Dabhi and Gill et al., "2D-LFM: Lifting Foundation Model without 3D Supervision"
Simply using transformers to do 2D-to-3D lifting of 2D landmarks fails by construction due to the permutation equivariance of the architecture -- inject positional encoding in multiple layers to fix