Inlay

4/ Interpretation: Attention heatmaps reveal that specialized tasks (medical, satellite) rely heavily on intermediate layers, while natural images favor later layers.

5/ Best part? It's parameter-efficient and keeps the backbone frozen! ❄️

1/ Task-relevant info is distributed across the entire hierarchy, not just the final layer. We propose Attentive Multi-Layer Fusion to unlock this potential.

3/ The impact: ✅ Consistent gains across 20 datasets. ✅ +5.54 pp avg. improvement over standard linear probes. ✅ Works across model scales (Small to Large) and training objectives (CLIP, DINOv2, Supervised)

Why you should probe more than just the final layer of your Vision Transformer to maximize performance. 🧵👇