4/ Interpretation: Attention heatmaps reveal that specialized tasks (medical, satellite) rely heavily on intermediate layers, while natural images favor later layers.
5/ Best part? It's parameter-efficient and keeps the backbone frozen! ❄️
1/ Task-relevant info is distributed across the entire hierarchy, not just the final layer. We propose Attentive Multi-Layer Fusion to unlock this potential.
3/ The impact:
✅ Consistent gains across 20 datasets.
✅ +5.54 pp avg. improvement over standard linear probes.
✅ Works across model scales (Small to Large) and training objectives (CLIP, DINOv2, Supervised)
Why you should probe more than just the final layer of your Vision Transformer to maximize performance. 🧵👇