内存管理 — Tag

The State of FP8 KV-Cache and Attention Quantization in vLLM

vLLM uses FP8 quantization for KV cache to halve memory usage and double throughput for long-context inference while maintaining accuracy, though specific performance pitfalls need attention.

vLLM Blog ·

Tag: 内存管理 (1 articles)

The State of FP8 KV-Cache and Attention Quantization in vLLM