We introduce a simple baseline called NoSense, an image-only (SigLIP) model that discards almost all temporal structure.
Surprisingly, it reaches 95% accuracy on VSI-Super-Recall (VSR), even on 4-hour videos.
This suggests VSR can be solved without true spatial supersensing.