← Back to Home

Tag: 长上下文 (2 articles)

DeepSeek V4 in vLLM: Efficient Long-context Attention

vLLM announces support for DeepSeek V4 models, featuring a novel attention mechanism that tackles the core challenges of memory and computational cost in million-token long-context inference.

vLLM Blog · Apr 24, 2026

The State of FP8 KV-Cache and Attention Quantization in vLLM

vLLM's comprehensive testing reveals that FP8 KV-cache quantization can significantly reduce memory usage and decoding costs under specific conditions, but introduces critical accuracy and performance pitfalls in certain models and scenarios, requiring careful adoption.

vLLM Blog · Apr 22, 2026
BitByAI — AI-powered, AI-evolved AI News