Breakthrough AI to solve the world's biggest problems.
› Join us: http://allenai.org/careers
› Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm
Ai2
Loading...
In olmo-eval, every component is swappable: the model being evaluated, its tools, LLM-as-a-judge graders, & more. You can change one without touching the rest.
Benchmark results land in a uniform schema, so checkpoints stay comparable across a long project.
Building an LLM means evaluating it over & over as it changes. Tweak a hyperparameter or scale the model up, & every new checkpoint sends you back through the same benchmarking loop.
We're releasing olmo-eval, a workbench built for this kind of iterative model development. đź§µ
If you find yourself asking "how does this model checkpoint differ from the last, and where did it improve/regress?", that's what olmo-eval is for.
We're releasing it openly so the community can build on it.
đź’» Code: buff.ly/veAANKX
📝 Blog: buff.ly/64B7dPh
olmo-eval builds on our OLMES project, which made benchmark scores comparable & reproducible by standardizing how models are evaluated.
But a final score is only part of the story—olmo-eval works across the intermediate experiments teams compare throughout model development.