Excited about this new work from @haoyuhe.bsky.social. TLDR: Diffusion language models treat learning and inference differently which lowers performance. RL can be used to overcome this issue for certain problems.
Andreas Geiger
We introduce a simple baseline called NoSense, an image-only (SigLIP) model that discards almost all temporal structure.
Surprisingly, it reaches 95% accuracy on VSI-Super-Recall (VSR), even on 4-hour videos.
This suggests VSR can be solved without true spatial supersensing.
For VSI-Super-Counting (VSC), we run a sanity check:
π VSC-Repeat: we concatenate each video with itself 1-5Γ
β Unique object count stays the same
β Cambrian-S accuracy drops from 42% β 0%
A genuine supersensing system should be robust here.
This indicates that the tailored Cambrian-S inference strategy may rely on benchmark-specific shortcuts (e.g. rooms are never revisited), rather than building a persistent, spatial world model over time.
π¨ New Paper: "Solving Spatial Supersensing Without Spatial Supersensing"
Huge credit to the Cambrian-S team for tackling one of the hardest open problems in video understanding: spatial supersensing. In our paper, we take a closer look at their benchmarks & methods π
Presenting A Sober Look at Progress in LM Reasoning at @colmweb.org today π¨π¦ #COLM2025
π Today
π 11:00 AM β 1:00 PM
π Room 710 - Poster #31
We find that many βreasoningβ gains fall within variance and show how to make evaluation reproducible again.
π bethgelab.github.io/sober-reasoning
Cambrian-S is a valuable first step in defining what βsupersensingβ might mean for video models. Our results simply highlight how subtle benchmark design choices can be exploited β and how we can improve them together.
π arxiv.org/abs/2511.16655
π github.com/bethgelab/s...