WiAIR is dedicated to celebrating the remarkable contributions of female AI researchers from around the globe. Our goal is to empower early career researchers, especially women, to pursue their passion for AI and make an impact in this exciting field.
Neural MT metrics show the strongest alignment with downstream performance. But the proxy has limits: some specialized benchmarks, including MGSM and INCLUDE, show weaker or more variable correlations. Task-specific evaluation remains necessary. (4/5 🧵)
Translation quality is measured on 3 parallel corpora using 7 MT metrics, both lexical and neural, then systematically correlated with downstream benchmark scores across languages and tasks. (3/5 🧵)
The paper evaluates 14 LLMs across 5 model families on 9 multilingual benchmarks spanning knowledge, reading comprehension, NLI, commonsense & mathematical reasoning, truthfulness, and regional knowledge. (2/5 🧵)
✨ Can translation quality serve as a scalable proxy for multilingual LLM evaluation?
In our latest #WiAIR episode, we host Dr. Saadia Gabriel (@skgabrie.bsky.social) to discuss "Translation as a Scalable Proxy for Multilingual Evaluation". (1/5 🧵)