Can we extend the power of world models beyond just online model-based learning? Absolutely!
We believe the true potential of world models lies in enabling agents to reason at test time.
Introducing DINO-WM: World Models on Pre-trained Visual Features for Zero-shot Planning.
Overall, DINO-WM takes a step toward bridging the gap between task-agnostic world modeling and reasoning and control, offering promising prospects for generic world models in real-world applications.
Huge thanks to all my collaborators who made this project possible @hengkaipan.bsky.social, @yann-lecun.bsky.social, @lerrelpinto.com
We have open-sourced our code and data. For more details, checkout the paper and website:
Website: dino-wm.github.io
arXiv: arxiv.org/abs/2411.04983
DINO-WM consists of:
1️⃣An out-of-the-box DINOv2 model as the observation model.
2️⃣A causal ViT as the predictor.
3️⃣A decoder that is optional for visualization.
DINO-WM plans entirely in latent space, without the need to reconstruct pixel images.
Unlike previous works that couple world model learning with behavior learning, we train a dynamics-only model and infer actions only at test time. This allows zero-shot goal-reaching by reasoning through the dynamics—no expert demonstrations, no rewards, no online interactions.
The object and spatial understanding priors of DINOv2 features enable robust scene understanding, essential for navigation and manipulation tasks. With this prior, DINO-WM outperforms state-of-the-art world models by 45% in downstream task performance on our hardest tasks.