Inlay

Profile

Breakthrough AI to solve the world's biggest problems. › Join us: http://allenai.org/careers › Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm

Ai2

In olmo-eval, every component is swappable: the model being evaluated, its tools, LLM-as-a-judge graders, & more. You can change one without touching the rest. Benchmark results land in a uniform schema, so checkpoints stay comparable across a long project.

Building an LLM means evaluating it over & over as it changes. Tweak a hyperparameter or scale the model up, & every new checkpoint sends you back through the same benchmarking loop. We're releasing olmo-eval, a workbench built for this kind of iterative model development. 🧵

If you find yourself asking "how does this model checkpoint differ from the last, and where did it improve/regress?", that's what olmo-eval is for. We're releasing it openly so the community can build on it. 💻 Code: buff.ly/veAANKX 📝 Blog: buff.ly/64B7dPh

olmo-eval builds on our OLMES project, which made benchmark scores comparable & reproducible by standardizing how models are evaluated. But a final score is only part of the story—olmo-eval works across the intermediate experiments teams compare throughout model development.