Inlay

That makes it hard to tell whether wins come from the model design or just from more data/compute or favorable benchmarks. We fix this by holding the pretraining dataset and compute constant and standardizing evaluation across tasks,