//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
Multimodal LLMs can read text in images, but why do they often perform worse than when the same text is given as tokens? Our work studies the modality gap of models perceiving text as pixels and shows how to close it. ๐Ÿ“„ arxiv.org/abs/2603.09095 ๐Ÿงต๐Ÿ‘‡ #NLProc #LLM #ComputerVision
3mo
Kaiser Sun