Загрузка видео...
Не удалось загрузить видео
We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇
431,142 просмотров • 2 лет назад •via X (Twitter)
Комментарии: 10

2/ website: Our approach has two key design decisions. First, we use a causal encoder to compress images and videos in a shared latent space.

3/ Second, for memory and training efficiency, we use a window attention based transformer architecture for joint spatial and temporal generative modeling in latent space.

4/ Our model can generate photorealistic, temporally consistent motion from natural language prompts.

5/ We can also use our model to animate any image.

6/ Finally, our model can be used to generate videos with consistent 3D camera motion.

7/ This work was done at @StanfordAILab, @StanfordSVL, @GoogleAI, @Google with amazing collaborators @LijunYu0, @kihyuk_sohn, @laoreja001, @MeeraHahn, @drfeifei, @irrfaan, @roadjiang, @jlezama

Results look great - coherent and not much warping. Can inference run on consumer hardware? Will the code and weights be released?

Great job, should scale very nicely 👀

The end of Hollywood. It can't come fast enough.

Really incredible coherency. The scale to minutes of video and pairing with audio seems quite believable with leaps like this. Kudos to your team.
