🚀 We're hiring! The @ellisinsttue.bsky.social leads the AI development for Germany’s new open-source nationwide Adaptive Intelligent System learning platform for schools (as part of a consortium led by Assecor & KI macht Schule, and mandated by the FWU).
👉 Apply now: forms.gle/XmLkwEDD45fY...
This work was a great collaboration; special shout-out to @jana-z.bsky.social for leading this project and submitting the first paper of her PhD!
How useful are self-generated 'mental images' (visual aids) in MLLM/UMM reasoning?
Turns out: currently not very. Visualizations have small errors that compound in multi-step problems, and models often ignore correct visual aids in their decision making.
The fact that we don't see strong benefits of using even ground-truth visuals points to information in the visual/textual domains being somewhat misaligned, potentially because models are not trained for similar tasks.
This work is motivated by the same intuition as my work on Video models last fall: Can media generation capabilities be useful beyond just generating nice visuals?
For real-world, embodied applications being able to visualize the outcome of an action seems useful.
Whether self-generated visuals can at some point serve a function similar to mental imagery in human thought remains to be seen.
For now, MentisOculi provides a small suite of tasks to study this topic.