Building an LLM means evaluating it over & over as it changes. Tweak a hyperparameter or scale the model up, & every new checkpoint sends you back through the same benchmarking loop.
We're releasing olmo-eval, a workbench built for this kind of iterative model development. 🧵