Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
NVIDIA introduces a task-seeded synthetic data generation pipeline that achieves double-digit benchmark improvements in Nemotron-3 Nano pretraining, signaling a new paradigm for synthetic data usage.
Hugging Face Blog · Jun 4, 2026
How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas
NVIDIA, in collaboration with Korean institutions, released a dataset of 6 million synthetic personas to ground AI agents in authentic Korean demographics and cultural context, moving beyond simple Western defaults.
Hugging Face Blog · Apr 21, 2026
Building a Fast Multilingual OCR Model with Synthetic Data
NVIDIA trained the Nemotron OCR v2 model on 12 million synthetic images, achieving high accuracy (NED as low as 0.035) and high speed (34.7 pages/second on a single A100 GPU) across six languages, demonstrating that synthetic data is a key solution to the multilingual data bottleneck in OCR.
Hugging Face Blog · Apr 18, 2026