正在加载视频...

视频加载失败

Curious whether video generation models (like #SORA) qualify as world models? We conduct a systematic study to answer this question by investigating whether a video gen model is able to learn physical laws. Three are three key messages to take home: 1⃣The model generalises perfectly for in-distribution data, but...

606,519 次观看 • 1 年前 •via X (Twitter)

10 条评论

Bingyi Kang 的头像
Bingyi Kang1 年前

The video was created by @raylu_THU, who consistently provides insightful discussions for analyzing experiments. Great Work!

Bingyi Kang 的头像
Bingyi Kang1 年前

The huggingface paper page:

Tianyuan Zhang 的头像
Tianyuan Zhang1 年前

Very cool paper. I guess one reason behind color > size > velocity > shape, is in your dataset, the color attributes affects lots of pixels and influence the L2 diffusion loss a lot.

Bingyi Kang 的头像
Bingyi Kang1 年前

Yeah, we do have a similar hypothesis, check the Open Discussion section ( on our project page for details.

Tommy Nic 的头像
Tommy Nic1 年前

The study shows we're not there yet. These models don’t grasp the ‘rules’ of physics as a true world model would. But early text models had similar limits – stuck in mimicry until they broke through to real generalization. There’s a good chance future video models will follow the same path.

iandanforth 🦋 @iandanforth.bsky.social 的头像
iandanforth 🦋 @iandanforth.bsky.social1 年前

Were the models also given task prompts as text to 'explain' the task to the model? Large image generation models have demonstrated language task ability so it's possible latent understanding / steer-ability exists in video generation models as well.

Bingyi Kang 的头像
Bingyi Kang1 年前

actually not, as each model is often trained for one task. However, we did try to use internal states (e.g., language description of size and velocity) of a physical event as prompt to the model. They often give worse ood generalization.

Yang Yue 的头像
Yang Yue1 年前

Thrilled to see this work come to life after 7-8 months of deep thinking about #SORA and its connection to physical laws. This paper has been my most demanding yet rewarding project. Proud to have been part of this journey! Check out our findings via the video, website and paer.

BensenHsu 的头像
BensenHsu1 年前

The paper explores whether video generation models can discover fundamental physical laws by merely observing visual data, without any human priors. This is an important question as video generation is seen as a promising path towards building scalable world models that can accurately simulate the physical world. The researchers' analysis revealed two key insights about the generalization mechanisms of the video generation models: 1. The models rely more on memorization and case-based imitation, rather than abstracting universal physical rules. 2. The models prioritize certain attributes (color > size > velocity > shape) when referencing training data during generalization, which may explain their difficulties in maintaining object consistency. full paper:

david glukhov 的头像
david glukhov1 年前

Extremely reminiscent of this earlier work which observed many of the same issues in a simpler setting and domain

相关视频

Small Language Models (SML) are the future of AI. "Small" (SML) instead of "Large" (LLM). These small models are highly specialized models with superhuman abilities on specific tasks. Here are two techniques to build these models: • Spectrum • Model Merging I give you a short introduction in the attached video, but here is a quick summary: Spectrum helps us identify the most relevant layers to solve one specific task. We can ignore everything else and focus on fine-tuning these layers. Using Spectrum, we can fine-tune models in a heartbeat. Model Merging combines multiple models into a unique, much better model than any of the individual input models. You can also combine models specialized in different tasks and get a model with multiple abilities. This is the state of the art of productizing models. It's what Arcee.ai's platform does behind the scenes. Arcee collaborated with me on this post and is sponsoring it. There are three main steps to produce a model for your particular use case: 1. You create a dataset by uploading your data. 2. You train a model. At this step, Arcee uses Spectrum and Model Merging to produce a highly specialized model for your task. 3. You can deploy that model to any environment you want. Three important notes: • Training process is 2x faster and 2x cheaper than regular fine-tuning. • Resultant models are smaller and have higher accuracy. • They create these specialized models from open-source models. Check this site so you can fully appreciate how this works: If you want to fine-tune an open-source model, consider Arcee's platform. This is the state of the art.

Santiago

164,162 次观看 • 1 年前

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

AK

40,467 次观看 • 1 年前

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

366,858 次观看 • 1 年前