🧵[5/n]📊
🧪 Training the Subnetwork Reproduces Full Model
1️⃣ When trained in isolation, the sparse subnetwork recovers almost the exact same weights as the full model
2️⃣ achieves comparable (or better) end-task performance
3️⃣ 🧮 Even the training loss converges more smoothly