Esmati and Nath et al., "The Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show"
You can use inversion to retrieve feature representations for a video, which can be linearly decoded into physical plausibility -- if you use enough steps not shortcuts