Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models
VeRL-Omni is a reinforcement learning training framework designed for multimodal generative models, addressing the engineering challenges of efficient and stable RL training on diffusion and omni-modality models, extending the LLM RL training paradigm to image, video, and audio generation.
vLLM Blog · May 14, 2026
Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
NVIDIA releases Nemotron 3 Nano Omni, a hybrid Mamba-Transformer model enabling long-context multimodal understanding of documents, audio, and video, leading multiple benchmarks and offering an efficient new option for AI agents handling complex real-world tasks.
Hugging Face Blog · Apr 28, 2026
Gemma 4 VLA Demo on Jetson Orin Nano Super
An end-to-end multimodal agent demo running on NVIDIA Jetson Orin Nano Super, showcasing how the model autonomously decides when to use the camera and answers questions with visual context, signaling the descent of powerful AI capabilities to edge devices.
Hugging Face Blog · Apr 22, 2026
Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
Hugging Face releases a new tutorial demonstrating how fine-tuning multimodal embedding models can yield performance far surpassing general-purpose large models in specific domains (like visual document retrieval), even outperforming models with 4x its parameters.
Hugging Face Blog · Apr 16, 2026
ChatGPT voice mode is a weaker model
Simon Willison reveals a counterintuitive fact: ChatGPT's voice mode runs on an older, weaker GPT-4o-era model, creating a massive gap between user expectations and reality.
Simon Willison · Apr 10, 2026
Multimodal Embedding & Reranker Models with Sentence Transformers
Sentence Transformers v5.4 introduces native multimodal embedding support, enabling text, images, audio, and video to share a unified vector space for cross-modal retrieval.
Hugging Face Blog · Apr 9, 2026
Gemma 4: Byte for byte, the most capable open models
Google DeepMind's Gemma 4 models innovate in parameter efficiency and support multi-modal inputs, marking a significant advancement in research on small effective models.
Simon Willison · Apr 3, 2026
Welcome Gemma 4: Frontier multimodal intelligence on device
Gemma 4 introduces enhanced multimodal capabilities, supporting image, text, and audio inputs, significantly improving model intelligence and deployment flexibility across devices.
Hugging Face Blog · Apr 2, 2026
Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents
Granite 4.0 3B Vision is a multimodal model designed for enterprise documents, offering efficient information extraction and chart understanding capabilities, transforming document processing.
Hugging Face Blog · Mar 31, 2026
Holotron-12B - High Throughput Computer Use Agent
Holotron-12B optimizes inference efficiency and handles long contexts, becoming a powerful tool for high-performance computing agents, crucial for AI applications.
Hugging Face Blog · Mar 17, 2026
Parsing the Unreadable: How LlamaParse Handles Legal Discovery Documents
LlamaParse leverages multimodal models to understand not just text but also charts, images, and complex layouts, fundamentally solving the parsing nightmare of low-quality scanned documents in legal discovery.
LlamaIndex Blog ·