[3/4] Do VLMs actually ground in the figure?
Fine-tuning Qwen3.5-9B on MQUD makes generated questions more grounded in the figure and more specific to the paperβs scientific content.
Yating Wu
What does a scientific figure make you wonder? π
We introduce MQUD: multimodal Questions Under Discussion for scientific figures.
With 1,250 author-annotated questions over 245 figures from 56 papers, MQUD asks what scientific question a figure raises in context.
I did a starter pack of ML/AI people at @utaustin.bsky.social Please distribute and feel free to self nominate!
go.bsky.app/QLQznZg
Check out our paper for more results and analysis!
π arxiv.org/abs/2504.09373
π github.com/AlliteraryAl...
This was a fun collaboration with @yatingwu.bsky.social @asher-zheng.bsky.social @manyawadhwa.bsky.social @gregdnlp.bsky.social @jessyjli.bsky.social
As large language models become increasingly capable at various writing tasks, their weakness at generating unique and creative content becomes a major liability. Although LLMs have the ability to gen...