Inlay

Cambrian-S is a valuable first step in defining what “supersensing” might mean for video models. Our results simply highlight how subtle benchmark design choices can be exploited — and how we can improve them together. 📄 arxiv.org/abs/2511.16655 🔗 github.com/bethgelab/s...

This indicates that the tailored Cambrian-S inference strategy may rely on benchmark-specific shortcuts (e.g. rooms are never revisited), rather than building a persistent, spatial world model over time.

For VSI-Super-Counting (VSC), we run a sanity check: 🔁 VSC-Repeat: we concatenate each video with itself 1-5× ✅ Unique object count stays the same ❌ Cambrian-S accuracy drops from 42% → 0% A genuine supersensing system should be robust here.