If you find yourself asking "how does this model checkpoint differ from the last, and where did it improve/regress?", that's what olmo-eval is for.
We're releasing it openly so the community can build on it.
š» Code: buff.ly/veAANKX
š Blog: buff.ly/64B7dPh
Contribute to allenai/olmo-eval development by creating an account on GitHub.