First paper of my PhD with my amazing advisors!
There’s been a ton of hype and media coverage on OpenEvidence as an “AI co-pilot for clinicians”… and our long-horizon benchmark puts them to the test!! Our results suggest they are far from reliable for downstream use.
Hayoung Jung
New preprint!
We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews.
We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well.
A thread 🧵
w/ @hayoungjung.bsky.social & others!