//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfilePosts





We generate code from a model, run it, and evaluate the following: Processing tasks: we compare key variable values. Visualizations: we use a VLM judge (well correlated w/ pro astronomers) that compares a visualization’s scientific utility to that of the ground truth.
How good are LLMs at 🔭 scientific computing and visualization 🔭? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵
We think this dataset is a great target for AI for science efforts. It zeroes in on an important part of the scientific workflow that is achievable near term and aims to produce tools used by astronomers, not aiming to replace them or automate all of science.
My amazing co-authors: Syed Murtaza Husain, Stella Offner, @stephajuneau.bsky.social, Paul Torrey, Adam Bolton, Juan Frias, @niall2.bsky.social, @gregdnlp.bsky.social, and @jessyjli.bsky.social. Full support from @nsfsimonscosmicai.bsky.social. 🌐: astrovisbench.github.io 📄: arxiv.org/abs/2505.20538
Even the best LLMs struggle to execute scientific workflows. SOTA models including Gemini 2.5 Pro, Claude Opus 4, o3-mini and QwQ crash 30-60% of the time and only produce visualizations without error in less than 16% of the cases.