@jeankaddour.bsky.social Nice work! You mention in the paper that TPO is orthogonal to approaches such as MaxRL. Have you tried both together by chance?
Je peut aussi recommander @openstreetmap.bsky.social pour ça. Au moins en Allemagne mais ça devrait marcher partout.
I agree with your points. But I would have assumed that the inherent noise in our data would easily allow one to show that not even coarse alignment holds. The different abstraction-levels of image and text in sth like WIT could help (less degrees of freedom) or completely destroy alignment (noise).
Unfortunately, Claude did not, in fact, learn a lesson.
To me it is positively surprising that the coarse alignment is still there. I wonder how much stronger image descriptions could improve this.
Congrats! Any information on how the model was trained? Distilled?