Introducing bluffbench, a new tool to evaluate how well LLMs actually see data plots.
When we trick LLMs with secret #RStats transformations, they can miss the visual contradiction.
bluffbench helps us measure this "blind spot" in AI coding agents. Learn more: posit.co/blog/introdu...
Data science agents need to accurately read plots even when the content contradicts their expectations. Our testing shows today's LLMs still struggle here.