Across all tasks, state-of-the-art multimodal models often perform at or near chance, even at relatively low difficulty.
Performance degrades rapidly as soon as reasoning requires sequential visual state updates, rather than long-horizon planning or complex rules.