//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
Re-ran this eval against Opus 4.8, Gemini 3.5 Flash, and GPT 5.5. Opus 4.8 is a modest improvement over the previously tested Opus models, but Gemini 3.5 Flash is the real stand-out! simonpcouch.github.io/bluffbench/
14d
Simon P. Couch
6mo
Introducing bluffbench, a new tool to evaluate how well LLMs actually see data plots. When we trick LLMs with secret #RStats transformations, they can miss the visual contradiction. bluffbench helps us measure this "blind spot" in AI coding agents. Learn more: posit.co/blog/introdu...
posit.co
Data science agents need to accurately read plots even when the content contradicts their expectations. Our testing shows today's LLMs still struggle here.
When plotting, LLMs see what they expect to see - Posit
Posit