Inlay

//

ProfileReplies

Loading...

🧵[3/n] 📉 Even Gradients Are Sparse in RL 📉 🧠 In PRIME, 72% of parameters never receive any gradient — ever! ↔️ Some do, but their gradients cancel out over time. 🎯 It’s not just sparse updates, even the gradients are sparse

🧵[2/n] 💡 SFT Updates Are Dense 💡 Unlike RL, Supervised Fine-Tuning (SFT) updates are much denser 🧠 📊 Sparsity is low — at most only 15.31% of parameters remain untouched.

🧵[5/n]📊 🧪 Training the Subnetwork Reproduces Full Model 1️⃣ When trained in isolation, the sparse subnetwork recovers almost the exact same weights as the full model 2️⃣ achieves comparable (or better) end-task performance 3️⃣ 🧮 Even the training loss converges more smoothly

🧵[8/n] To the best of our knowledge this is the first mechanistic evidence that shows contrast between learning from in distribution (or on-policy) data vs Off Distribution (off-policy) data.

May 21, 2025

🧵[6/n] 🌐 The Subnetwork Is General 🔁 Subnetworks trained with different seed, datasets, or even algorithms show nontrivial overlap 🧩 Suggests the subnetwork is a generalizable structure tied to the base model 🧠 A shared backbone seems to emerge, no matter how you train it