//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...
We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations. https://evalevalai.com/
EvalEval Coalition








Loading...
3 days left! πŸ“ƒ Writing, wrote, or just submitted a paper? Commit it to the EvalEval workshop at ACL 2026 in San Diego! evalevalai.com/events/2026-... (including ARR Submissions, non-archival, positions, and extended abstracts!) Submission Deadline: March 19th, 2026 AoE
⏳ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026. If your work touches AI evaluation, submit! We welcome: βœ… Regular papers βœ… ARR submissions βœ… Non-archival work βœ… Position papers βœ… Extended abstracts πŸ“… Deadline: March 19 🌐 evalevalai.com/events/2026-...
Read the full announcement: evalevalai.com/infrastructu... Shared Task: evalevalai.com/events/share... Project Webpage: evalevalai.com/projects/eve... #AIEvaluation #EvalEval
Thankful to our partners for the feedback: CAISI, AIEleuther, Huggingface, NomaSecurity, TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research 🀝
How can you help? We are launching a shared task alongside our workshop at @aclmeeting.bsky.social β†’ Two tracks: public + proprietary eval data β†’ Co-authorship for qualifying contributors β†’ Workshop at ACL 2026 (San Diego) β†’ Deadline: May 1, 2026 πŸ“…
What we built: πŸ“‹ Metadata schema for cross-framework comparison πŸ”§ Validation via Hugging Face Jobs πŸ”Œ Converters (Inspect AI, HELM, lm-eval-harness) πŸ“Š Community repo organized by benchmark/model/run ✨ Captures scores AND context: settings, prompts, example-level data
This has real costs! πŸ”¬ Signal buried in noise, can't tell if differences reflect model capability or just setup πŸ“¦ Evaluation debt piles up silently across the ecosystem πŸ”ŽRedundant re-runs of expensive evaluations 🌟That's where Every Eval Ever comes
πŸ€”Consider the scenario LLaMA 65B scored 0.637 on HELM's MMLU LLaMA 65B scored 0.488 on lm-eval-harness's MMLU Same model. Same benchmark name. Different prompts, settings, extraction methods. πŸ’‘Which score is right? Both? Neither? We can't compare. 🀷
πŸš€ Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting πŸš€ A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch πŸ”§ A tale of broken AI evals πŸ§΅πŸ‘‡ evalevalai.com/projects/eve...
3mo
3mo
4mo
4mo
4mo
4mo
4mo
4mo
4mo