QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Hugging Face Blog 工具链进阶 Impact: 6/10

QIMMA's validate-then-evaluate approach uncovered systematic quality issues in mainstream Arabic benchmarks, signaling AI evaluation's shift from data volume to quality assurance.

Key Points

Mainstream Arabic benchmarks suffer from translation distortion and annotation errors
with many 'authoritative' datasets having questionable quality
QIMMA pioneered a 'clean-then-evaluate' pipeline
manually validating 109 subsets from 14 benchmarks
retaining 99% native Arabic content
First Arabic leaderboard supporting code evaluation
filling the gap for HumanEval+ and MBPP+ in Arabic contexts
Evaluation is shifting from 'language coverage' to 'measurement validity
offering lessons for Chinese AI globalization and domestic benchmarking

Analysis

"Think translated benchmarks are good enough? The evaluation crisis in Arabic AI serves as a wake-up call.

Analysis generated by BitByAI · Read original English article

大模型评测多语言模型数据质量 AI出海阿拉伯语AI