Zooming in on Rush Hour, we compare reasoning paradigms ranging from text-only MLLMs to models with latent or explicit visual reasoning.
None of these paradigms reliably outperform the others, indicating that making visual reasoning more explicit does not solve the problem.