//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfilePosts








Loading...
๐Ÿค”Consider the scenario LLaMA 65B scored 0.637 on HELM's MMLU LLaMA 65B scored 0.488 on lm-eval-harness's MMLU Same model. Same benchmark name. Different prompts, settings, extraction methods. ๐Ÿ’กWhich score is right? Both? Neither? We can't compare. ๐Ÿคท
๐Ÿš€ Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting ๐Ÿš€ A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch ๐Ÿ”ง A tale of broken AI evals ๐Ÿงต๐Ÿ‘‡ evalevalai.com/projects/eve...
What we built: ๐Ÿ“‹ Metadata schema for cross-framework comparison ๐Ÿ”ง Validation via Hugging Face Jobs ๐Ÿ”Œ Converters (Inspect AI, HELM, lm-eval-harness) ๐Ÿ“Š Community repo organized by benchmark/model/run โœจ Captures scores AND context: settings, prompts, example-level data
Thankful to our partners for the feedback: CAISI, AIEleuther, Huggingface, NomaSecurity, TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research ๐Ÿค
This has real costs! ๐Ÿ”ฌ Signal buried in noise, can't tell if differences reflect model capability or just setup ๐Ÿ“ฆ Evaluation debt piles up silently across the ecosystem ๐Ÿ”ŽRedundant re-runs of expensive evaluations ๐ŸŒŸThat's where Every Eval Ever comes
โณ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026. If your work touches AI evaluation, submit! We welcome: โœ… Regular papers โœ… ARR submissions โœ… Non-archival work โœ… Position papers โœ… Extended abstracts ๐Ÿ“… Deadline: March 19 ๐ŸŒ evalevalai.com/events/2026-...
How can you help? We are launching a shared task alongside our workshop at @aclmeeting.bsky.social โ†’ Two tracks: public + proprietary eval data โ†’ Co-authorship for qualifying contributors โ†’ Workshop at ACL 2026 (San Diego) โ†’ Deadline: May 1, 2026 ๐Ÿ“…
3 days left! ๐Ÿ“ƒ Writing, wrote, or just submitted a paper? Commit it to the EvalEval workshop at ACL 2026 in San Diego! evalevalai.com/events/2026-... (including ARR Submissions, non-archival, positions, and extended abstracts!) Submission Deadline: March 19th, 2026 AoE