Good work from @hayoungjung.bsky.social and @manoelhortaribeiro.bsky.social
Scientific AI agents are actively being deployed to synthesize clinical conclusions, but their factual accuracy remains remarkably low.
#MedSky
đź”— Direct link: arxiv.org/pdf/2606.11337
arxiv.org
Scott McGrath
First paper of my PhD with my amazing advisors!
There’s been a ton of hype and media coverage on OpenEvidence as an “AI co-pilot for clinicians”… and our long-horizon benchmark puts them to the test!! Our results suggest they are far from reliable for downstream use.
New preprint!
We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews.
We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well.
A thread đź§µ
w/ @hayoungjung.bsky.social & others!