What we built:
๐ Metadata schema for cross-framework comparison
๐ง Validation via Hugging Face Jobs
๐ Converters (Inspect AI, HELM, lm-eval-harness)
๐ Community repo organized by benchmark/model/run
โจ Captures scores AND context: settings, prompts, example-level data
๐คConsider the scenario
LLaMA 65B scored 0.637 on HELM's MMLU
LLaMA 65B scored 0.488 on lm-eval-harness's MMLU
Same model. Same benchmark name. Different prompts, settings, extraction methods.
๐กWhich score is right? Both? Neither? We can't compare. ๐คท
This has real costs!
๐ฌ Signal buried in noise, can't tell if differences reflect model capability or just setup
๐ฆ Evaluation debt piles up silently across the ecosystem
๐Redundant re-runs of expensive evaluations
๐That's where Every Eval Ever comes
Thankful to our partners for the feedback: CAISI, AIEleuther, Huggingface, NomaSecurity, TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research ๐ค
โณ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026.
If your work touches AI evaluation, submit!
We welcome:
โ Regular papers
โ ARR submissions
โ Non-archival work
โ Position papers
โ Extended abstracts
๐ Deadline: March 19
๐ evalevalai.com/events/2026-...
๐ Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting ๐
A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch ๐ง
A tale of broken AI evals ๐งต๐
evalevalai.com/projects/eve...