Inlay

Is the problem simply bad image generation? We provide models with ground-truth visual chains of thought (oracle intermediate states) and instruct them to use these visuals in their reasoning. Performance improves only in some tasks, and often remains at chance.