Same tasks, different representation:
When visual states are transcribed into text, many models can solve problems they fail in the visual setting.
This suggests the bottleneck is not logic, but reasoning in the visual domain itself.
Overall, our results point to a dual failure of machine mental imagery:
models struggle both to generate and to interpret visual states as actionable evidence for sequential decision-making.