Inlay

ProfilePosts

Excited about this new work from @haoyuhe.bsky.social. TLDR: Diffusion language models treat learning and inference differently which lowers performance. RL can be used to overcome this issue for certain problems.

9mo

Andreas Geiger

We introduce a simple baseline called NoSense, an image-only (SigLIP) model that discards almost all temporal structure. Surprisingly, it reaches 95% accuracy on VSI-Super-Recall (VSR), even on 4-hour videos. This suggests VSR can be solved without true spatial supersensing.

6mo

For VSI-Super-Counting (VSC), we run a sanity check: 🔁 VSC-Repeat: we concatenate each video with itself 1-5× ✅ Unique object count stays the same ❌ Cambrian-S accuracy drops from 42% → 0% A genuine supersensing system should be robust here.

This indicates that the tailored Cambrian-S inference strategy may rely on benchmark-specific shortcuts (e.g. rooms are never revisited), rather than building a persistent, spatial world model over time.

6mo

🚨 New Paper: "Solving Spatial Supersensing Without Spatial Supersensing" Huge credit to the Cambrian-S team for tackling one of the hardest open problems in video understanding: spatial supersensing. In our paper, we take a closer look at their benchmarks & methods 👇

6mo

Presenting A Sober Look at Progress in LM Reasoning at @colmweb.org today 🇨🇦 #COLM2025 📅 Today 🕔 11:00 AM – 1:00 PM 📍 Room 710 - Poster #31 We find that many “reasoning” gains fall within variance and show how to make evaluation reproducible again. 📘 bethgelab.github.io/sober-reasoning

8mo

Cambrian-S is a valuable first step in defining what “supersensing” might mean for video models. Our results simply highlight how subtle benchmark design choices can be exploited — and how we can improve them together. 📄 arxiv.org/abs/2511.16655 🔗 github.com/bethgelab/s...

6mo

Andreas Hochlehnert