Re-ran this eval against Opus 4.8, Gemini 3.5 Flash, and GPT 5.5. Opus 4.8 is a modest improvement over the previously tested Opus models, but Gemini 3.5 Flash is the real stand-out!
simonpcouch.github.io/bluffbench/
Simon P. Couch
Introducing bluffbench, a new tool to evaluate how well LLMs actually see data plots.
When we trick LLMs with secret #RStats transformations, they can miss the visual contradiction.
bluffbench helps us measure this "blind spot" in AI coding agents. Learn more: posit.co/blog/introdu...
posit.co
Data science agents need to accurately read plots even when the content contradicts their expectations. Our testing shows today's LLMs still struggle here.