Building a Fast Multilingual OCR Model with Synthetic Data
Hugging Face Blog 工具链 入门 Impact: 7/10
NVIDIA trained the Nemotron OCR v2 model on 12 million synthetic images, achieving high accuracy (NED as low as 0.035) and high speed (34.7 pages/second on a single A100 GPU) across six languages, demonstrating that synthetic data is a key solution to the multilingual data bottleneck in OCR.
Key Points
- The bottleneck is data
- not architecture: v1 performed poorly on non-English languages (NED up to 0.92) due to lack of training data covering multilingual characters.
- Synthetic data is the breakthrough: Programmatically rendering text onto images provides both massive scale (12 million images) and perfectly accurate labels (bounding boxes
- transcriptions
- reading order).
- Achieving both speed and accuracy: Accuracy gains come from vast multilingual synthetic data; speed gains stem from a shared detection backbone architecture that eliminates redundant computation.
- The solution is generic and open: The data generation pipeline is extensible to any language with available fonts and source text
- and the model and dataset are open-sourced.
Analysis
"The Cause: The Persistent Challenge of Multilingual OCR
Analysis generated by BitByAI · Read original English article