๐คConsider the scenario
LLaMA 65B scored 0.637 on HELM's MMLU
LLaMA 65B scored 0.488 on lm-eval-harness's MMLU
Same model. Same benchmark name. Different prompts, settings, extraction methods.
๐กWhich score is right? Both? Neither? We can't compare. ๐คท
๐ Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting ๐
A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch ๐ง
A tale of broken AI evals ๐งต๐
evalevalai.com/projects/eve...
What we built:
๐ Metadata schema for cross-framework comparison
๐ง Validation via Hugging Face Jobs
๐ Converters (Inspect AI, HELM, lm-eval-harness)
๐ Community repo organized by benchmark/model/run
โจ Captures scores AND context: settings, prompts, example-level data
Thankful to our partners for the feedback: CAISI, AIEleuther, Huggingface, NomaSecurity, TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research ๐ค
This has real costs!
๐ฌ Signal buried in noise, can't tell if differences reflect model capability or just setup
๐ฆ Evaluation debt piles up silently across the ecosystem
๐Redundant re-runs of expensive evaluations
๐That's where Every Eval Ever comes
โณ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026.
If your work touches AI evaluation, submit!
We welcome:
โ Regular papers
โ ARR submissions
โ Non-archival work
โ Position papers
โ Extended abstracts
๐ Deadline: March 19
๐ evalevalai.com/events/2026-...
How can you help?
We are launching a shared task alongside our workshop at @aclmeeting.bsky.social
โ Two tracks: public + proprietary eval data
โ Co-authorship for qualifying contributors
โ Workshop at ACL 2026 (San Diego)
โ Deadline: May 1, 2026 ๐
3 days left!
๐ Writing, wrote, or just submitted a paper?
Commit it to the EvalEval workshop at ACL 2026 in San Diego!
evalevalai.com/events/2026-...
(including ARR Submissions, non-archival, positions, and extended abstracts!)
Submission Deadline: March 19th, 2026 AoE