Tag: 量化技术 (2 articles)

A First Comprehensive Study of TurboQuant: Accuracy and Performance

A comprehensive benchmark by the vLLM team reveals that TurboQuant generally underperforms FP8 quantization and is only potentially viable for extreme memory-constrained edge deployments.

vLLM Blog ·

The State of FP8 KV-Cache and Attention Quantization in vLLM

vLLM uses FP8 quantization for KV cache to halve memory usage and double throughput for long-context inference while maintaining accuracy, though specific performance pitfalls need attention.

vLLM Blog ·