//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...
Breakthrough AI to solve the world's biggest problems. › Join us: http://allenai.org/careers › Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm
Ai2





Loading...
olmo-eval builds on our OLMES project, which made benchmark scores comparable & reproducible by standardizing how models are evaluated. But a final score is only part of the story—olmo-eval works across the intermediate experiments teams compare throughout model development.
23h
If you find yourself asking "how does this model checkpoint differ from the last, and where did it improve/regress?", that's what olmo-eval is for. We're releasing it openly so the community can build on it. 💻 Code: buff.ly/veAANKX 📝 Blog: buff.ly/64B7dPh
After training a model with a new intervention, olmo-eval lets you line two model checkpoints up question by question—holding everything else fixed. The comparison view makes it easier to see real gains & regressions.
In olmo-eval, every component is swappable: the model being evaluated, its tools, LLM-as-a-judge graders, & more. You can change one without touching the rest. Benchmark results land in a uniform schema, so checkpoints stay comparable across a long project.
23h
Ai2
Building an LLM means evaluating it over & over as it changes. Tweak a hyperparameter or scale the model up, & every new checkpoint sends you back through the same benchmarking loop. We're releasing olmo-eval, a workbench built for this kind of iterative model development. 🧵
23h
23h