Inlay

DINO-WM consists of: 1️⃣An out-of-the-box DINOv2 model as the observation model. 2️⃣A causal ViT as the predictor. 3️⃣A decoder that is optional for visualization. DINO-WM plans entirely in latent space, without the need to reconstruct pixel images.