Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Curious whether video generation models (like #SORA) qualify as world models? We conduct a systematic study to answer this question by investigating whether a video gen model is able to learn physical laws. Three are three key messages to take home: 1⃣The model generalises perfectly for in-distribution data, but...

606,600 Aufrufe • vor 1 Jahr •via X (Twitter)

10 Kommentare

Profilbild von Bingyi Kang
Bingyi Kangvor 1 Jahr

The video was created by @raylu_THU, who consistently provides insightful discussions for analyzing experiments. Great Work!

Profilbild von Bingyi Kang
Bingyi Kangvor 1 Jahr

The huggingface paper page:

Profilbild von Tianyuan Zhang
Tianyuan Zhangvor 1 Jahr

Very cool paper. I guess one reason behind color > size > velocity > shape, is in your dataset, the color attributes affects lots of pixels and influence the L2 diffusion loss a lot.

Profilbild von Bingyi Kang
Bingyi Kangvor 1 Jahr

Yeah, we do have a similar hypothesis, check the Open Discussion section ( on our project page for details.

Profilbild von Tommy Nic
Tommy Nicvor 1 Jahr

The study shows we're not there yet. These models don’t grasp the ‘rules’ of physics as a true world model would. But early text models had similar limits – stuck in mimicry until they broke through to real generalization. There’s a good chance future video models will follow the same path.

Profilbild von iandanforth 🦋 @iandanforth.bsky.social
iandanforth 🦋 @iandanforth.bsky.socialvor 1 Jahr

Were the models also given task prompts as text to 'explain' the task to the model? Large image generation models have demonstrated language task ability so it's possible latent understanding / steer-ability exists in video generation models as well.

Profilbild von Bingyi Kang
Bingyi Kangvor 1 Jahr

actually not, as each model is often trained for one task. However, we did try to use internal states (e.g., language description of size and velocity) of a physical event as prompt to the model. They often give worse ood generalization.

Profilbild von Yang Yue
Yang Yuevor 1 Jahr

Thrilled to see this work come to life after 7-8 months of deep thinking about #SORA and its connection to physical laws. This paper has been my most demanding yet rewarding project. Proud to have been part of this journey! Check out our findings via the video, website and paer.

Profilbild von BensenHsu
BensenHsuvor 1 Jahr

The paper explores whether video generation models can discover fundamental physical laws by merely observing visual data, without any human priors. This is an important question as video generation is seen as a promising path towards building scalable world models that can accurately simulate the physical world. The researchers' analysis revealed two key insights about the generalization mechanisms of the video generation models: 1. The models rely more on memorization and case-based imitation, rather than abstracting universal physical rules. 2. The models prioritize certain attributes (color > size > velocity > shape) when referencing training data during generalization, which may explain their difficulties in maintaining object consistency. full paper:

Profilbild von david glukhov
david glukhovvor 1 Jahr

Extremely reminiscent of this earlier work which observed many of the same issues in a simpler setting and domain

Ähnliche Videos

Small Language Models (SML) are the future of AI. "Small" (SML) instead of "Large" (LLM). These small models are highly specialized models with superhuman abilities on specific tasks. Here are two techniques to build these models: • Spectrum • Model Merging I give you a short introduction in the attached video, but here is a quick summary: Spectrum helps us identify the most relevant layers to solve one specific task. We can ignore everything else and focus on fine-tuning these layers. Using Spectrum, we can fine-tune models in a heartbeat. Model Merging combines multiple models into a unique, much better model than any of the individual input models. You can also combine models specialized in different tasks and get a model with multiple abilities. This is the state of the art of productizing models. It's what Arcee.ai's platform does behind the scenes. Arcee collaborated with me on this post and is sponsoring it. There are three main steps to produce a model for your particular use case: 1. You create a dataset by uploading your data. 2. You train a model. At this step, Arcee uses Spectrum and Model Merging to produce a highly specialized model for your task. 3. You can deploy that model to any environment you want. Three important notes: • Training process is 2x faster and 2x cheaper than regular fine-tuning. • Resultant models are smaller and have higher accuracy. • They create these specialized models from open-source models. Check this site so you can fully appreciate how this works: If you want to fine-tune an open-source model, consider Arcee's platform. This is the state of the art.

Santiago

164,162 Aufrufe • vor 1 Jahr

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

AK

40,474 Aufrufe • vor 2 Jahren

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

366,948 Aufrufe • vor 1 Jahr