Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇

Agrim Gupta

5,545 subscribers

431,142 views • 2 years ago •via X (Twitter)

Arts Education Science & Technology

Anya Rossi• Live Now

Private livecam show

10 Comments

Agrim Gupta2 years ago

2/ website: Our approach has two key design decisions. First, we use a causal encoder to compress images and videos in a shared latent space.

Agrim Gupta2 years ago

3/ Second, for memory and training efficiency, we use a window attention based transformer architecture for joint spatial and temporal generative modeling in latent space.

Agrim Gupta2 years ago

4/ Our model can generate photorealistic, temporally consistent motion from natural language prompts.

Agrim Gupta2 years ago

5/ We can also use our model to animate any image.

Agrim Gupta2 years ago

6/ Finally, our model can be used to generate videos with consistent 3D camera motion.

Agrim Gupta2 years ago

7/ This work was done at @StanfordAILab, @StanfordSVL, @GoogleAI, @Google with amazing collaborators @LijunYu0, @kihyuk_sohn, @laoreja001, @MeeraHahn, @drfeifei, @irrfaan, @roadjiang, @jlezama

TomLikesRobots🤖2 years ago

Results look great - coherent and not much warping. Can inference run on consumer hardware? Will the code and weights be released?

Emad2 years ago

Great job, should scale very nicely 👀

Dave Lalande2 years ago

The end of Hollywood. It can't come fast enough.

Justin Halford2 years ago

Really incredible coherency. The scale to minutes of video and pairing with audio seems quite believable with leaps like this. Kudos to your team.

Related Videos

The video is a Llama v1 7B model implemented in MLX and running on an M2 Ultra. More here: * Train a Transformer LM or fine-tune with LoRA * Text generation with Mistral * Image generation with Stable Diffusion * Speech recognition with Whisper

The video is a Llama v1 7B model implemented in MLX and running on an M2 Ultra. More here: * Train a Transformer LM or fine-tune with LoRA * Text generation with Mistral * Image generation with Stable Diffusion * Speech recognition with Whisper

Awni Hannun

66,565 views • 2 years ago

MotionStreamer announced on Hugging Face Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

MotionStreamer announced on Hugging Face Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

AK

43,387 views • 1 year ago

🎉 Announcing v1.0 🥳 - 95 second music generation - Latent diffusion model (like stable diffusion) - Can do genre fusions - Trained by Harmonai

🎉 Announcing v1.0 🥳 - 95 second music generation - Latent diffusion model (like stable diffusion) - Can do genre fusions - Trained by Harmonai

dadabots

58,813 views • 2 years ago

Congrats to LemonSlice on their $10.5M seed! They built the world's first interactive talking AI video model—a face layer for voice agents. Trained on a custom, 20B-parameter video diffusion transformer, streaming at 20fps on a single GPU. Infinite-length video generation with no error accumulation.

Congrats to LemonSlice on their $10.5M seed! They built the world's first interactive talking AI video model—a face layer for voice agents. Trained on a custom, 20B-parameter video diffusion transformer, streaming at 20fps on a single GPU. Infinite-length video generation with no error accumulation.

Y Combinator

50,846 views • 6 months ago

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

AK

194,356 views • 2 years ago

A breakthrough in real-time video generation. As a research preview developed with NVIDIA and shared at NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

A breakthrough in real-time video generation. As a research preview developed with NVIDIA and shared at NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

Runway

1,161,699 views • 3 months ago

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

AK

62,719 views • 1 year ago

What if a foundation model could align histology, spatial biology & clinical data to reveal latent biomedical insights? 🚀 Introducing Haiku — a tri-modal foundation model trained on 26.7M+ spatial proteomics patches with matched H&E and clinical text, aligned in one shared embedding space. 📄 🧵👇

What if a foundation model could align histology, spatial biology & clinical data to reveal latent biomedical insights? 🚀 Introducing Haiku — a tri-modal foundation model trained on 26.7M+ spatial proteomics patches with matched H&E and clinical text, aligned in one shared embedding space. 📄 🧵👇

Zhi Huang

13,236 views • 2 months ago

We're releasing Paris 2.0, which, to our knowledge, is the world's first decentralized trained video generation model. We benchmarked it against a monolithic model trained on the same data and compute budget, and Paris 2.0 outperformed the monolithic by ~2x on FVD benchmark.

We're releasing Paris 2.0, which, to our knowledge, is the world's first decentralized trained video generation model. We benchmarked it against a monolithic model trained on the same data and compute budget, and Paris 2.0 outperformed the monolithic by ~2x on FVD benchmark.

Bidhan Roy

453,897 views • 1 month ago

LeVERB is a VLA framework for humanoid whole-body control, combining a vision-language model and a low-level controller via a shared latent action space, trained entirely in sim, deployed zero shot.

LeVERB is a VLA framework for humanoid whole-body control, combining a vision-language model and a low-level controller via a shared latent action space, trained entirely in sim, deployed zero shot.

The Humanoid Hub

10,233 views • 1 year ago

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

AK

40,474 views • 2 years ago

(1/10) 🔥Thrilled to introduce OneDiffusion—our latest work in unified diffusion modeling! 🚀 This model bridges the gap between image synthesis and understanding, excelling in a wide range of tasks: T2I, conditional generation, image understanding, identity preservation, multiview generation, and even camera pose estimation. Learn more at: Project: arXiv: Code (on the way):

(1/10) 🔥Thrilled to introduce OneDiffusion—our latest work in unified diffusion modeling! 🚀 This model bridges the gap between image synthesis and understanding, excelling in a wide range of tasks: T2I, conditional generation, image understanding, identity preservation, multiview generation, and even camera pose estimation. Learn more at: Project: arXiv: Code (on the way):

Jiasen Lu

33,383 views • 1 year ago

3D controllable video generation works well now and pixel alignment is impressive! We took our new 3D sculpt model, generated a blender scene and then asked a video-to-video model to render. Image to 3D sculpt: Common Sense Machines Synthetic rendering: Blender 🔶 Video-to-Video: Runway

3D controllable video generation works well now and pixel alignment is impressive! We took our new 3D sculpt model, generated a blender scene and then asked a video-to-video model to render. Image to 3D sculpt: Common Sense Machines Synthetic rendering: Blender 🔶 Video-to-Video: Runway

Common Sense Machines

18,714 views • 1 year ago

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

AK

31,997 views • 2 years ago

We're super excited to introduce Zeno-1, our advanced image generation model, with a bunch of control tools, all available on Shakker AI!

We're super excited to introduce Zeno-1, our advanced image generation model, with a bunch of control tools, all available on Shakker AI!

ShakkerAI

27,296 views • 1 year ago

One image + text + camera trajectory = controllable worlds. All on a single GPU. Our research team just released SANA-WM, a 2.6B open source world model natively trained for 60-second video generation with precise camera control.

One image + text + camera trajectory = controllable worlds. All on a single GPU. Our research team just released SANA-WM, a 2.6B open source world model natively trained for 60-second video generation with precise camera control.

NVIDIA AI

92,602 views • 1 month ago

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video flow matching (next-frame prediction) At inference, TV2TV dynamically alternates between textual thinking and video generation. Model generations below: interleaved text plans and video slices (~1–2s) are co-generated over time, conditioned on a single frame per sport. 📖

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video flow matching (next-frame prediction) At inference, TV2TV dynamically alternates between textual thinking and video generation. Model generations below: interleaved text plans and video slices (~1–2s) are co-generated over time, conditioned on a single frame per sport. 📖

Xiaochuang Han

31,095 views • 6 months ago

Make Pixels Dance: High-Dynamic Video Generation paper page: Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

Make Pixels Dance: High-Dynamic Video Generation paper page: Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

AK

101,655 views • 2 years ago

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

367,000 views • 1 year ago

Video diffusion models generate high-quality videos but are too slow for interactive applications. We MIT CSAIL Adobe Research introduce CausVid, a fast autoregressive video diffusion model that starts playing the moment you hit "Generate"! A thread 🧵

Video diffusion models generate high-quality videos but are too slow for interactive applications. We MIT CSAIL Adobe Research introduce CausVid, a fast autoregressive video diffusion model that starts playing the moment you hit "Generate"! A thread 🧵

Tianwei Yin

83,851 views • 1 year ago