Inlay

Good work from @hayoungjung.bsky.social and @manoelhortaribeiro.bsky.social Scientific AI agents are actively being deployed to synthesize clinical conclusions, but their factual accuracy remains remarkably low. #MedSky 🔗 Direct link: arxiv.org/pdf/2606.11337

First paper of my PhD with my amazing advisors! There’s been a ton of hype and media coverage on OpenEvidence as an “AI co-pilot for clinicians”… and our long-horizon benchmark puts them to the test!! Our results suggest they are far from reliable for downstream use.

New preprint! We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews. We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well. A thread 🧵 w/ @hayoungjung.bsky.social & others!