//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...








Work co-led with @thwiedemer.bsky.social, in collaboration with Sayak Mallick, Matthias Bethge and @wielandbrendel.bsky.social. Website: brendel-group.github.io/llm-line/ Preprint: arxiv.org/abs/2502.12120 Code: github.com/brendel-grou... 8/8
Feb 18, 2025
Compute-to-train loss scaling laws guide LLM pretraining, but how do training/val losses map to downstream task loss? What factors shape these laws? We analyze loss-to-loss scaling laws, extending prior work beyond a single architectural setting to a number of configurations. 2/8
🔍 Our work refines the understanding of scaling laws beyond compute-based models, showing that loss-to-loss trends are shaped by training data, not model structure. The implications? Better dataset curation can unlock better generalization. 6/8
Prasanna Mayilvahanan
Feb 18, 2025
Further, our results suggest that for a given pretraining data, breaking past current loss-to-loss trends requires radically new architectures or loss functions. Existing models all behave strikingly alike. 7/8
New preprint out! 🎉 How does LLM training loss translate to downstream performance? We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role! brendel-group.github.io/llm-line/ 🧵1/8
Feb 18, 2025
Feb 18, 2025
📊 Key finding: The choice of pretraining data and tokenizer has the largest impact on scaling trends. Even switching from Llama (Transformer) to Mamba (State-Space Model) barely changes loss-to-loss relationships! 4/8
Feb 18, 2025
📉 In contrast, architecture, model size, context length, and optimizer settings have negligible impact. This suggests architectures can be freely optimized for efficiency, while data curation is the real key to strong generalization. 5/8
We systematically vary pretraining data, tokenizer, architecture (Llama vs. Mamba), model size, context length, and optimizer settings—evaluating over 6000 model checkpoints—to uncover the true drivers of loss-to-loss scaling laws. 3/8
Feb 18, 2025
Prasanna Mayilvahanan
Feb 18, 2025
Feb 18, 2025
Prasanna Mayilvahanan
Prasanna Mayilvahanan
Prasanna Mayilvahanan
Prasanna Mayilvahanan
Prasanna Mayilvahanan
Prasanna Mayilvahanan
CuratedThoughts: Data Curation for RL Datasets 🚀 Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation. Here's why 👇🧵
Feb 17, 2025
Andreas Hochlehnert