Multimodal LLMs can read text in images, but why do they often perform worse than when the same text is given as tokens? Our work studies the modality gap of models perceiving text as pixels and shows how to close it.
๐ arxiv.org/abs/2603.09095
๐งต๐ #NLProc #LLM #ComputerVision