Inlay

Zooming in on Rush Hour, we compare reasoning paradigms ranging from text-only MLLMs to models with latent or explicit visual reasoning. None of these paradigms reliably outperform the others, indicating that making visual reasoning more explicit does not solve the problem.