Inlay

Microsoft Research's Lens: detailed captions beat raw training scale for image generation quality. MAI-Image-2.5 ranked 3rd on Arena.ai, behind only OpenAI's ChatGPT Images 2.0. Annotation investment can substitute for GPU hours.

Microsoft Research presents Lens, a text-to-image model with just 3.8 billion parameters that matches much larger rivals on benchmarks, at a fraction of the training cost. The secret sauce: 800 million detailed image captions generated by GPT-4.1 instead of vague web alt-text. Code and weights are o