Inlay

Profile

This is also the first paper of my PhD — huge thanks to my amazing co-authors: @thwiedemer.bsky.social , Fanfei Li, @thokle.bsky.social , @prasannamayil.bsky.social , Matthias Bethge, Felix Wichmann, Ryan Cotterell, and @wielandbrendel.bsky.social

Is the problem simply bad image generation? We provide models with ground-truth visual chains of thought (oracle intermediate states) and instruct them to use these visuals in their reasoning. Performance improves only in some tasks, and often remains at chance.

Overall, our results point to a dual failure of machine mental imagery: models struggle both to generate and to interpret visual states as actionable evidence for sequential decision-making.

Can AI reason by “imagining” — not just by seeing or reading? We introduce Mentis Oculi, a benchmark for machine mental imagery: multi-step visual puzzles that require maintaining and updating visual states over time. 📄 arxiv.org/abs/2602.02465 🌐 jana-z.github.io/mentis-oculi/ 🧵⬇️

4mo

Across all tasks, state-of-the-art multimodal models often perform at or near chance, even at relatively low difficulty. Performance degrades rapidly as soon as reasoning requires sequential visual state updates, rather than long-horizon planning or complex rules.

Same tasks, different representation: When visual states are transcribed into text, many models can solve problems they fail in the visual setting. This suggests the bottleneck is not logic, but reasoning in the visual domain itself.

Zooming in on Rush Hour, we compare reasoning paradigms ranging from text-only MLLMs to models with latent or explicit visual reasoning. None of these paradigms reliably outperform the others, indicating that making visual reasoning more explicit does not solve the problem.

What does Mentis Oculi test? A collection of visual reasoning tasks (e.g. Rush Hour, Sliding Puzzle) designed to probe whether models can mentally transform visual states across multiple steps. Each puzzle is specified by a single image, but solving it requires a visual rollout.

4mo

Video

4mo