We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.
https://evalevalai.com/
EvalEval Coalition
Loading...
3 days left!
π Writing, wrote, or just submitted a paper?
Commit it to the EvalEval workshop at ACL 2026 in San Diego!
evalevalai.com/events/2026-...
(including ARR Submissions, non-archival, positions, and extended abstracts!)
Submission Deadline: March 19th, 2026 AoE
β³ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026.
If your work touches AI evaluation, submit!
We welcome:
β Regular papers
β ARR submissions
β Non-archival work
β Position papers
β Extended abstracts
π Deadline: March 19
π evalevalai.com/events/2026-...
Read the full announcement: evalevalai.com/infrastructu...
Shared Task: evalevalai.com/events/share...
Project Webpage: evalevalai.com/projects/eve...
#AIEvaluation #EvalEval
Thankful to our partners for the feedback: CAISI, AIEleuther, Huggingface, NomaSecurity, TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research π€
How can you help?
We are launching a shared task alongside our workshop at @aclmeeting.bsky.social
β Two tracks: public + proprietary eval data
β Co-authorship for qualifying contributors
β Workshop at ACL 2026 (San Diego)
β Deadline: May 1, 2026 π
What we built:
π Metadata schema for cross-framework comparison
π§ Validation via Hugging Face Jobs
π Converters (Inspect AI, HELM, lm-eval-harness)
π Community repo organized by benchmark/model/run
β¨ Captures scores AND context: settings, prompts, example-level data
This has real costs!
π¬ Signal buried in noise, can't tell if differences reflect model capability or just setup
π¦ Evaluation debt piles up silently across the ecosystem
πRedundant re-runs of expensive evaluations
πThat's where Every Eval Ever comes
π€Consider the scenario
LLaMA 65B scored 0.637 on HELM's MMLU
LLaMA 65B scored 0.488 on lm-eval-harness's MMLU
Same model. Same benchmark name. Different prompts, settings, extraction methods.
π‘Which score is right? Both? Neither? We can't compare. π€·
π Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting π
A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch π§
A tale of broken AI evals π§΅π
evalevalai.com/projects/eve...