Inlay

Re-ran this eval against Opus 4.8, Gemini 3.5 Flash, and GPT 5.5. Opus 4.8 is a modest improvement over the previously tested Opus models, but Gemini 3.5 Flash is the real stand-out! simonpcouch.github.io/bluffbench/

Introducing bluffbench, a new tool to evaluate how well LLMs actually see data plots. When we trick LLMs with secret #RStats transformations, they can miss the visual contradiction. bluffbench helps us measure this "blind spot" in AI coding agents. Learn more: posit.co/blog/introdu...

Data science agents need to accurately read plots even when the content contradicts their expectations. Our testing shows today's LLMs still struggle here.