Inlay

//

ProfileReplies

Loading...

What we built: 📋 Metadata schema for cross-framework comparison 🔧 Validation via Hugging Face Jobs 🔌 Converters (Inspect AI, HELM, lm-eval-harness) 📊 Community repo organized by benchmark/model/run ✨ Captures scores AND context: settings, prompts, example-level data

🤔Consider the scenario LLaMA 65B scored 0.637 on HELM's MMLU LLaMA 65B scored 0.488 on lm-eval-harness's MMLU Same model. Same benchmark name. Different prompts, settings, extraction methods. 💡Which score is right? Both? Neither? We can't compare. 🤷

This has real costs! 🔬 Signal buried in noise, can't tell if differences reflect model capability or just setup 📦 Evaluation debt piles up silently across the ecosystem 🔎Redundant re-runs of expensive evaluations 🌟That's where Every Eval Ever comes

Thankful to our partners for the feedback: CAISI, AIEleuther, Huggingface, NomaSecurity, TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research 🤝

⏳ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026. If your work touches AI evaluation, submit! We welcome: ✅ Regular papers ✅ ARR submissions ✅ Non-archival work ✅ Position papers ✅ Extended abstracts 📅 Deadline: March 19 🌐 evalevalai.com/events/2026-...

🚀 Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting 🚀 A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch 🔧 A tale of broken AI evals 🧵👇 evalevalai.com/projects/eve...