Breakthrough AI to solve the world's biggest problems.
› Join us: http://allenai.org/careers
› Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm
Ai2
Loading...
olmo-eval builds on our OLMES project, which made benchmark scores comparable & reproducible by standardizing how models are evaluated.
But a final score is only part of the story—olmo-eval works across the intermediate experiments teams compare throughout model development.
If you find yourself asking "how does this model checkpoint differ from the last, and where did it improve/regress?", that's what olmo-eval is for.
We're releasing it openly so the community can build on it.
💻 Code: buff.ly/veAANKX
📝 Blog: buff.ly/64B7dPh
After training a model with a new intervention, olmo-eval lets you line two model checkpoints up question by question—holding everything else fixed.
The comparison view makes it easier to see real gains & regressions.
In olmo-eval, every component is swappable: the model being evaluated, its tools, LLM-as-a-judge graders, & more. You can change one without touching the rest.
Benchmark results land in a uniform schema, so checkpoints stay comparable across a long project.
Ai2
Building an LLM means evaluating it over & over as it changes. Tweak a hyperparameter or scale the model up, & every new checkpoint sends you back through the same benchmarking loop.
We're releasing olmo-eval, a workbench built for this kind of iterative model development. 🧵