Large-scale visual pretraining is useful but NOT enough! It's not tailored to the dynamics of the environment and retains many planning-irrelevant low-level details. e.g. In DINOv2 feature space, the latent trajectories are curved & L2 distances don't reflect geodesic distances.