//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
First paper of my PhD with my amazing advisors! There’s been a ton of hype and media coverage on OpenEvidence as an “AI co-pilot for clinicians”… and our long-horizon benchmark puts them to the test!! Our results suggest they are far from reliable for downstream use.
5d
Hayoung Jung
New preprint! We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews. We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well. A thread 🧵 w/ @hayoungjung.bsky.social & others!
5d
Manoel Horta Ribeiro