Check out our newest paper!
As always, it was super fun working on this with @prasannamayil.bsky.social
Thaddäus Wiedemer
New preprint out! 🎉
How does LLM training loss translate to downstream performance?
We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8