Inlay

//

Profile

Loading...

Breakthrough AI to solve the world's biggest problems. › Join us: http://allenai.org/careers › Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm

Ai2

Loading...

olmo-eval builds on our OLMES project, which made benchmark scores comparable & reproducible by standardizing how models are evaluated. But a final score is only part of the story—olmo-eval works across the intermediate experiments teams compare throughout model development.

23h

If you find yourself asking "how does this model checkpoint differ from the last, and where did it improve/regress?", that's what olmo-eval is for. We're releasing it openly so the community can build on it. 💻 Code: buff.ly/veAANKX 📝 Blog: buff.ly/64B7dPh

After training a model with a new intervention, olmo-eval lets you line two model checkpoints up question by question—holding everything else fixed. The comparison view makes it easier to see real gains & regressions.

In olmo-eval, every component is swappable: the model being evaluated, its tools, LLM-as-a-judge graders, & more. You can change one without touching the rest. Benchmark results land in a uniform schema, so checkpoints stay comparable across a long project.

23h

Ai2

Building an LLM means evaluating it over & over as it changes. Tweak a hyperparameter or scale the model up, & every new checkpoint sends you back through the same benchmarking loop. We're releasing olmo-eval, a workbench built for this kind of iterative model development. 🧵

23h