Inlay

Profile

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations. https://evalevalai.com/

EvalEval Coalition

3 days left! 📃 Writing, wrote, or just submitted a paper? Commit it to the EvalEval workshop at ACL 2026 in San Diego! evalevalai.com/events/2026-... (including ARR Submissions, non-archival, positions, and extended abstracts!) Submission Deadline: March 19th, 2026 AoE

⏳ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026. If your work touches AI evaluation, submit! We welcome: ✅ Regular papers ✅ ARR submissions ✅ Non-archival work ✅ Position papers ✅ Extended abstracts 📅 Deadline: March 19 🌐 evalevalai.com/events/2026-...

Read the full announcement: evalevalai.com/infrastructu... Shared Task: evalevalai.com/events/share... Project Webpage: evalevalai.com/projects/eve... #AIEvaluation #EvalEval

Thankful to our partners for the feedback: CAISI, AIEleuther, Huggingface, NomaSecurity, TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research 🤝

How can you help? We are launching a shared task alongside our workshop at @aclmeeting.bsky.social → Two tracks: public + proprietary eval data → Co-authorship for qualifying contributors → Workshop at ACL 2026 (San Diego) → Deadline: May 1, 2026 📅

What we built: 📋 Metadata schema for cross-framework comparison 🔧 Validation via Hugging Face Jobs 🔌 Converters (Inspect AI, HELM, lm-eval-harness) 📊 Community repo organized by benchmark/model/run ✨ Captures scores AND context: settings, prompts, example-level data

This has real costs! 🔬 Signal buried in noise, can't tell if differences reflect model capability or just setup 📦 Evaluation debt piles up silently across the ecosystem 🔎Redundant re-runs of expensive evaluations 🌟That's where Every Eval Ever comes

🤔Consider the scenario LLaMA 65B scored 0.637 on HELM's MMLU LLaMA 65B scored 0.488 on lm-eval-harness's MMLU Same model. Same benchmark name. Different prompts, settings, extraction methods. 💡Which score is right? Both? Neither? We can't compare. 🤷

🚀 Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting 🚀 A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch 🔧 A tale of broken AI evals 🧵👇 evalevalai.com/projects/eve...

3mo

4mo