Inlay

We introduce a simple baseline called NoSense, an image-only (SigLIP) model that discards almost all temporal structure. Surprisingly, it reaches 95% accuracy on VSI-Super-Recall (VSR), even on 4-hour videos. This suggests VSR can be solved without true spatial supersensing.