//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfileReplies









Loading...
What we built: ๐Ÿ“‹ Metadata schema for cross-framework comparison ๐Ÿ”ง Validation via Hugging Face Jobs ๐Ÿ”Œ Converters (Inspect AI, HELM, lm-eval-harness) ๐Ÿ“Š Community repo organized by benchmark/model/run โœจ Captures scores AND context: settings, prompts, example-level data
๐Ÿค”Consider the scenario LLaMA 65B scored 0.637 on HELM's MMLU LLaMA 65B scored 0.488 on lm-eval-harness's MMLU Same model. Same benchmark name. Different prompts, settings, extraction methods. ๐Ÿ’กWhich score is right? Both? Neither? We can't compare. ๐Ÿคท
This has real costs! ๐Ÿ”ฌ Signal buried in noise, can't tell if differences reflect model capability or just setup ๐Ÿ“ฆ Evaluation debt piles up silently across the ecosystem ๐Ÿ”ŽRedundant re-runs of expensive evaluations ๐ŸŒŸThat's where Every Eval Ever comes
Thankful to our partners for the feedback: CAISI, AIEleuther, Huggingface, NomaSecurity, TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research ๐Ÿค
โณ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026. If your work touches AI evaluation, submit! We welcome: โœ… Regular papers โœ… ARR submissions โœ… Non-archival work โœ… Position papers โœ… Extended abstracts ๐Ÿ“… Deadline: March 19 ๐ŸŒ evalevalai.com/events/2026-...
๐Ÿš€ Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting ๐Ÿš€ A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch ๐Ÿ”ง A tale of broken AI evals ๐Ÿงต๐Ÿ‘‡ evalevalai.com/projects/eve...