Inlay

It was super fun to take our first step in interpreting multimodal LLMs, working closely with the brilliant @alexpietroserra.bsky.social and @EmanuelePanizon

🎯 Key finding: In these models the hidden representations of images and text form disjoint clusters and the communication between modalities is mediated by the special token <end-of-image>!

✅ This shows that, starting from the mid-layers, a single token effectively summarizes all 1024 image tokens! ❌ This does not occur in models fine-tuned for visual understanding (such as Pixtral).