olmo-eval builds on our OLMES project, which made benchmark scores comparable & reproducible by standardizing how models are evaluated.
But a final score is only part of the story—olmo-eval works across the intermediate experiments teams compare throughout model development.