← Back to Home

Building a Fast Multilingual OCR Model with Synthetic Data

Hugging Face Blog 工具链 入门 Impact: 7/10

NVIDIA trained the Nemotron OCR v2 model on 12 million synthetic images, achieving high accuracy (NED as low as 0.035) and high speed (34.7 pages/second on a single A100 GPU) across six languages, demonstrating that synthetic data is a key solution to the multilingual data bottleneck in OCR.

Key Points

  • The bottleneck is data
  • not architecture: v1 performed poorly on non-English languages (NED up to 0.92) due to lack of training data covering multilingual characters.
  • Synthetic data is the breakthrough: Programmatically rendering text onto images provides both massive scale (12 million images) and perfectly accurate labels (bounding boxes
  • transcriptions
  • reading order).
  • Achieving both speed and accuracy: Accuracy gains come from vast multilingual synthetic data; speed gains stem from a shared detection backbone architecture that eliminates redundant computation.
  • The solution is generic and open: The data generation pipeline is extensible to any language with available fonts and source text
  • and the model and dataset are open-sourced.

Analysis

"The Cause: The Persistent Challenge of Multilingual OCR

Analysis generated by BitByAI · Read original English article

BitByAI — AI-powered, AI-evolved AI News