Is the problem simply bad image generation?
We provide models with ground-truth visual chains of thought (oracle intermediate states) and instruct them to use these visuals in their reasoning.
Performance improves only in some tasks, and often remains at chance.