//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfileReplies









Loading...
🧵[3/n] 📉 Even Gradients Are Sparse in RL 📉 🧠 In PRIME, 72% of parameters never receive any gradient — ever! ↔️ Some do, but their gradients cancel out over time. 🎯 It’s not just sparse updates, even the gradients are sparse
🧵[2/n] 💡 SFT Updates Are Dense 💡 Unlike RL, Supervised Fine-Tuning (SFT) updates are much denser 🧠 📊 Sparsity is low — at most only 15.31% of parameters remain untouched.
🧵[5/n]📊 🧪 Training the Subnetwork Reproduces Full Model 1️⃣ When trained in isolation, the sparse subnetwork recovers almost the exact same weights as the full model 2️⃣ achieves comparable (or better) end-task performance 3️⃣ 🧮 Even the training loss converges more smoothly
🧵[8/n] To the best of our knowledge this is the first mechanistic evidence that shows contrast between learning from in distribution (or on-policy) data vs Off Distribution (off-policy) data.
May 21, 2025
May 21, 2025
May 21, 2025
May 21, 2025
🧵[6/n] 🌐 The Subnetwork Is General 🔁 Subnetworks trained with different seed, datasets, or even algorithms show nontrivial overlap 🧩 Suggests the subnetwork is a generalizable structure tied to the base model 🧠 A shared backbone seems to emerge, no matter how you train it