//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
Good work from @hayoungjung.bsky.social and @manoelhortaribeiro.bsky.social Scientific AI agents are actively being deployed to synthesize clinical conclusions, but their factual accuracy remains remarkably low. #MedSky đź”— Direct link: arxiv.org/pdf/2606.11337
2d
arxiv.org
Scott McGrath
First paper of my PhD with my amazing advisors! There’s been a ton of hype and media coverage on OpenEvidence as an “AI co-pilot for clinicians”… and our long-horizon benchmark puts them to the test!! Our results suggest they are far from reliable for downstream use.
2d
New preprint! We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews. We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well. A thread đź§µ w/ @hayoungjung.bsky.social & others!
2d
Manoel Horta Ribeiro
Hayoung Jung