Inlay

We generate code from a model, run it, and evaluate the following: Processing tasks: we compare key variable values. Visualizations: we use a VLM judge (well correlated w/ pro astronomers) that compares a visualization’s scientific utility to that of the ground truth.