正在加载视频...
视频加载失败
Curious whether video generation models (like #SORA) qualify as world models? We conduct a systematic study to answer this question by investigating whether a video gen model is able to learn physical laws. Three are three key messages to take home: 1⃣The model generalises perfectly for in-distribution data, but... show more
10 条评论

The video was created by @raylu_THU, who consistently provides insightful discussions for analyzing experiments. Great Work!

The huggingface paper page:

Very cool paper. I guess one reason behind color > size > velocity > shape, is in your dataset, the color attributes affects lots of pixels and influence the L2 diffusion loss a lot.

Yeah, we do have a similar hypothesis, check the Open Discussion section ( on our project page for details.

The study shows we're not there yet. These models don’t grasp the ‘rules’ of physics as a true world model would. But early text models had similar limits – stuck in mimicry until they broke through to real generalization. There’s a good chance future video models will follow the same path.

Were the models also given task prompts as text to 'explain' the task to the model? Large image generation models have demonstrated language task ability so it's possible latent understanding / steer-ability exists in video generation models as well.

actually not, as each model is often trained for one task. However, we did try to use internal states (e.g., language description of size and velocity) of a physical event as prompt to the model. They often give worse ood generalization.

Thrilled to see this work come to life after 7-8 months of deep thinking about #SORA and its connection to physical laws. This paper has been my most demanding yet rewarding project. Proud to have been part of this journey! Check out our findings via the video, website and paer.

The paper explores whether video generation models can discover fundamental physical laws by merely observing visual data, without any human priors. This is an important question as video generation is seen as a promising path towards building scalable world models that can accurately simulate the physical world. The researchers' analysis revealed two key insights about the generalization mechanisms of the video generation models: 1. The models rely more on memorization and case-based imitation, rather than abstracting universal physical rules. 2. The models prioritize certain attributes (color > size > velocity > shape) when referencing training data during generalization, which may explain their difficulties in maintaining object consistency. full paper:

Extremely reminiscent of this earlier work which observed many of the same issues in a simpler setting and domain

