The State of FP8 KV-Cache and Attention Quantization in vLLM
vLLM uses FP8 quantization for KV cache to halve memory usage and double throughput for long-context inference while maintaining accuracy, though specific performance pitfalls need attention.
vLLM Blog ·