Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇

431,142 görüntüleme • 2 yıl önce •via X (Twitter)

10 Yorum

Agrim Gupta profil fotoğrafı
Agrim Gupta2 yıl önce

2/ website: Our approach has two key design decisions. First, we use a causal encoder to compress images and videos in a shared latent space.

Agrim Gupta profil fotoğrafı
Agrim Gupta2 yıl önce

3/ Second, for memory and training efficiency, we use a window attention based transformer architecture for joint spatial and temporal generative modeling in latent space.

Agrim Gupta profil fotoğrafı
Agrim Gupta2 yıl önce

4/ Our model can generate photorealistic, temporally consistent motion from natural language prompts.

Agrim Gupta profil fotoğrafı
Agrim Gupta2 yıl önce

5/ We can also use our model to animate any image.

Agrim Gupta profil fotoğrafı
Agrim Gupta2 yıl önce

6/ Finally, our model can be used to generate videos with consistent 3D camera motion.

Agrim Gupta profil fotoğrafı
Agrim Gupta2 yıl önce

7/ This work was done at @StanfordAILab, @StanfordSVL, @GoogleAI, @Google with amazing collaborators @LijunYu0, @kihyuk_sohn, @laoreja001, @MeeraHahn, @drfeifei, @irrfaan, @roadjiang, @jlezama

TomLikesRobots🤖 profil fotoğrafı
TomLikesRobots🤖2 yıl önce

Results look great - coherent and not much warping. Can inference run on consumer hardware? Will the code and weights be released?

Emad profil fotoğrafı
Emad2 yıl önce

Great job, should scale very nicely 👀

Dave Lalande profil fotoğrafı
Dave Lalande2 yıl önce

The end of Hollywood. It can't come fast enough.

Justin Halford profil fotoğrafı
Justin Halford2 yıl önce

Really incredible coherency. The scale to minutes of video and pairing with audio seems quite believable with leaps like this. Kudos to your team.

Benzer Videolar

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

AK

40,474 görüntüleme • 2 yıl önce

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

366,948 görüntüleme • 1 yıl önce