Zang et al., "World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible"
A Diffusion Transformer that estimates multiple layers of depth to further estimate occluded parts as well.
metricscenes.github.io
haoz19.github.io/world-tracin...
zlab-princeton.github.io/i1/
Zeng et al., “i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models”
A fully reproducible recipe & code & weights and everything for a truly open text-to-image model. A LOT of interesting findings.
2dlfm.github.io
Dabhi and Gill et al., "2D-LFM: Lifting Foundation Model without 3D Supervision"
Simply using transformers to do 2D-to-3D lifting of 2D landmarks fails by construction due to the permutation equivariance of the architecture -- inject positional encoding in multiple layers to fix
Xiangli et al., "Honey, I Shrunk the Arc de Triomphe!"
Metric depth estimators aren't actually metric. With curated, scaled data, they can be adapted to be better.
arxiv.org/abs/2606.05328
Esmati and Nath et al., "The Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show"
You can use inversion to retrieve feature representations for a video, which can be linearly decoded into physical plausibility -- if you use enough steps not shortcuts
Video
Video
Project page for 2D-LFM: Lifting Foundation Model without 3D Supervision.
Project page for i1, a simple and fully open recipe for strong text-to-image models.
zlab-princeton.github.io
Modern video diffusion models generate increasingly realistic and temporally coherent videos, motivating their use as candidate world simulators. Yet it remains unclear whether these models internally...