We generate code from a model, run it, and evaluate the following:
Processing tasks: we compare key variable values.
Visualizations: we use a VLM judge (well correlated w/ pro astronomers) that compares a visualization’s scientific utility to that of the ground truth.