Tag: 模型推理 (3 articles)

A First Comprehensive Study of TurboQuant: Accuracy and Performance

A comprehensive benchmark by the vLLM team reveals that TurboQuant generally underperforms FP8 quantization and is only potentially viable for extreme memory-constrained edge deployments.

vLLM Blog ·

Building Blocks for Foundation Model Training and Inference on AWS

AWS details the infrastructure supporting the full foundation model lifecycle from pre-training and post-training to inference, revealing a paradigm shift from a single scaling law to three, and the deep integration trend of open-source software stacks with cloud infrastructure.

Hugging Face Blog ·

Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM

NVIDIA releases Nemotron 3 Nano Omni, a 30B-parameter MoE model that achieves extreme efficiency by activating only 3B parameters, offering a unified and cost-effective solution for multimodal AI agents.

vLLM Blog ·