//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...







This is also the first paper of my PhD — huge thanks to my amazing co-authors: @thwiedemer.bsky.social , Fanfei Li, @thokle.bsky.social , @prasannamayil.bsky.social , Matthias Bethge, Felix Wichmann, Ryan Cotterell, and @wielandbrendel.bsky.social
Is the problem simply bad image generation? We provide models with ground-truth visual chains of thought (oracle intermediate states) and instruct them to use these visuals in their reasoning. Performance improves only in some tasks, and often remains at chance.
Overall, our results point to a dual failure of machine mental imagery: models struggle both to generate and to interpret visual states as actionable evidence for sequential decision-making.
Can AI reason by “imagining” — not just by seeing or reading? We introduce Mentis Oculi, a benchmark for machine mental imagery: multi-step visual puzzles that require maintaining and updating visual states over time. 📄 arxiv.org/abs/2602.02465 🌐 jana-z.github.io/mentis-oculi/ 🧵⬇️
4mo
4mo
Across all tasks, state-of-the-art multimodal models often perform at or near chance, even at relatively low difficulty. Performance degrades rapidly as soon as reasoning requires sequential visual state updates, rather than long-horizon planning or complex rules.
Same tasks, different representation: When visual states are transcribed into text, many models can solve problems they fail in the visual setting. This suggests the bottleneck is not logic, but reasoning in the visual domain itself.
Zooming in on Rush Hour, we compare reasoning paradigms ranging from text-only MLLMs to models with latent or explicit visual reasoning. None of these paradigms reliably outperform the others, indicating that making visual reasoning more explicit does not solve the problem.
What does Mentis Oculi test? A collection of visual reasoning tasks (e.g. Rush Hour, Sliding Puzzle) designed to probe whether models can mentally transform visual states across multiple steps. Each puzzle is specified by a single image, but solving it requires a visual rollout.
4mo
4mo
4mo
4mo
4mo
Video
4mo