The paper includes quite some math about estimating Hessians from CLIP efficiently which turned out harder as we initially thought due to the InfoNCE loss being cross-modal and contrastive. Anton, Rui, and Martin did a heckuva job to get this done in a rigorous way!