Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Diffusion models generate high-quality images but require hundreds of forward passes. MIT CSAIL and Adobe Research introduce Distribution Matching Distillation (DMD), a distillation approach that converts costly multi-step diffusion models into fast one-step generators. A thread 🧵

MIT CSAIL

344,248 subscribers

34,347 views • 2 years ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

9 Comments

MIT CSAIL2 years ago

DMD trains a one-step generator that maps random noise into realistic images, consisting of two key components. First up: it uses a regression loss to anchor the mapping process, ensuring a coarse organization of the image space, enhancing the stability of the training phase.

MIT CSAIL2 years ago

Additionally, it employs a distribution matching loss to guarantee that the likelihood of generating a specific image w/the student model aligns w/its actual frequency of occurrence in the real world.

MIT CSAIL2 years ago

The gradient of this loss is formulated as the difference between two diffusion models’ output, trained on real and fake samples respectively.

MIT CSAIL2 years ago

DMD achieves a strong 11.49 FID on zero-shot COCO-30K, comparable to Stable Diffusion v1.5 while being 30X faster. Compared to previous approaches, it notably balances image quality with sample diversity.

MIT CSAIL2 years ago

DMD paves the way for real-time visual generation. This same approach could improve diffusion-based generative models across various fields, from design, to scientific discovery and beyond, by significantly enhancing speed and effectiveness.

MIT CSAIL2 years ago

Paper: Authors: @TianweiY, @m_gharbi, @rzhang88, @elishechtman, @fredodurand, Bill Freeman, and Taesung Park. Project page: MIT News:

menguzat2 years ago

@AdobeResearch will you release the code / model for this?

Prashant2 years ago

@AdobeResearch Could this approach of distribution matching loss be applied to other generative AI tasks besides image generation? For example, text generation or music composition?

𝗦𝗼𝘂𝗹𝘀𝗳𝗲𝗻𝗴 𝗡𝗲𝘄 𝗬𝗼𝗿𝗸2 years ago

@rzhang88 @AdobeResearch Good for you my friend, we are try use your model colorization(which is 4 years ago) for sneakers now, lol, thank you for everything.

Related Videos

Video diffusion models generate high-quality videos but are too slow for interactive applications. We MIT CSAIL Adobe Research introduce CausVid, a fast autoregressive video diffusion model that starts playing the moment you hit "Generate"! A thread 🧵

Video diffusion models generate high-quality videos but are too slow for interactive applications. We MIT CSAIL Adobe Research introduce CausVid, a fast autoregressive video diffusion model that starts playing the moment you hit "Generate"! A thread 🧵

Tianwei Yin

83,714 views • 1 year ago

Presto! Distilling Steps and Layers for Accelerating Music Generation Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge.

Presto! Distilling Steps and Layers for Accelerating Music Generation Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge.

AK

30,430 views • 1 year ago

Nvidia presents Articulated Kinematics Distillation from Video Diffusion Models

Nvidia presents Articulated Kinematics Distillation from Video Diffusion Models

AK

39,189 views • 1 year ago

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,417 views • 8 months ago

We're moving beyond autoregressive LLMs! Autoregressive LLMs generate text word-by-word, which can be slow and affect quality, while diffusion models refine noise step-by-step, allowing for faster iterations and error correction. Here's Gemini Diffusion running at 857 tokens/s:

We're moving beyond autoregressive LLMs! Autoregressive LLMs generate text word-by-word, which can be slow and affect quality, while diffusion models refine noise step-by-step, allowing for faster iterations and error correction. Here's Gemini Diffusion running at 857 tokens/s:

Akshay 🚀

34,524 views • 1 year ago

AccVideo just dropped on Hugging Face Accelerating Video Diffusion Model with Synthetic Dataset present a efficient distillation method to accelerate video diffusion models with synthetic dataset method is 8.5x faster than HunyuanVideo

AccVideo just dropped on Hugging Face Accelerating Video Diffusion Model with Synthetic Dataset present a efficient distillation method to accelerate video diffusion models with synthetic dataset method is 8.5x faster than HunyuanVideo

AK

20,633 views • 1 year ago

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 views • 2 years ago

Diffusion models make great images. But can they drive robots? Usually that gets complicated really fast. We figured out how to get a Stable Diffusion model (based on Instruct pix2pix) to drive robotic instruction following. Simple recipe, works on a wide range of tasks. Thread👇

Diffusion models make great images. But can they drive robots? Usually that gets complicated really fast. We figured out how to get a Stable Diffusion model (based on Instruct pix2pix) to drive robotic instruction following. Simple recipe, works on a wide range of tasks. Thread👇

Sergey Levine

126,523 views • 2 years ago

This is amazing! You can now create high-quality 3D Scenes from a single image using Multi-Instance Diffusion Models (MIDI) 🔥

This is amazing! You can now create high-quality 3D Scenes from a single image using Multi-Instance Diffusion Models (MIDI) 🔥

Gradio

41,770 views • 1 year ago

Claude distillation has been a big topic this week while I am (coincidentally) writing Chapter 8 on model distillation. In that context, I shared some utilities to generate distillation data from all sorts of open-weight models via OpenRouter and Ollama:

Claude distillation has been a big topic this week while I am (coincidentally) writing Chapter 8 on model distillation. In that context, I shared some utilities to generate distillation data from all sorts of open-weight models via OpenRouter and Ollama:

Sebastian Raschka

62,458 views • 3 months ago

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

AK

31,997 views • 2 years ago

Oh wow! "[SIGGRAPH '24] DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models" Paper: Project: Code (MIT) :

Oh wow! "[SIGGRAPH '24] DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models" Paper: Project: Code (MIT) :

MrNeRF

59,480 views • 2 years ago

Introducing ConceptAttention, an approach to interpreting diffusion transformer models! Write a prompt, choose some concepts, generate an image, and get high-quality heatmaps of text concepts. Our method outperforms existing methods like cross attention. Link to demo 👇

Introducing ConceptAttention, an approach to interpreting diffusion transformer models! Write a prompt, choose some concepts, generate an image, and get high-quality heatmaps of text concepts. Our method outperforms existing methods like cross attention. Link to demo 👇

Alec Helbling

36,631 views • 1 year ago

9 Key AI Concepts Explained in 7 minutes - Tokenization - Text Decoding - Prompt Engineering - Multi Step AI Agents - RAGs - RLHF - VAE - Diffusion Models - LoRA

9 Key AI Concepts Explained in 7 minutes - Tokenization - Text Decoding - Prompt Engineering - Multi Step AI Agents - RAGs - RLHF - VAE - Diffusion Models - LoRA

Bytebytego

98,387 views • 4 months ago

I'm thrilled to announce the launch of ⚡️Flash Diffusion from Jasper! Earlier this year, with our acquisition of Clipdrop, we launched the Jasper AI Research Lab in Paris. Today, we are excited to release our first piece of groundbreaking research: the open-source distillation method, "Flash Diffusion". Flash Diffusion accelerates inference by 500%, reduces computing costs, and produces higher-quality image outputs. Dive into the details and discover how Flash Diffusion is set to revolutionize the field of AI and image synthesis. Read all about it here: Try a demo on Hugging Face:

I'm thrilled to announce the launch of ⚡️Flash Diffusion from Jasper! Earlier this year, with our acquisition of Clipdrop, we launched the Jasper AI Research Lab in Paris. Today, we are excited to release our first piece of groundbreaking research: the open-source distillation method, "Flash Diffusion". Flash Diffusion accelerates inference by 500%, reduces computing costs, and produces higher-quality image outputs. Dive into the details and discover how Flash Diffusion is set to revolutionize the field of AI and image synthesis. Read all about it here: Try a demo on Hugging Face:

Timothy Young

10,062 views • 2 years ago

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 views • 2 years ago

Adobe announced DRAGON on Hugging Face Distributional Rewards Optimize Diffusion Generative Models

Adobe announced DRAGON on Hugging Face Distributional Rewards Optimize Diffusion Generative Models

AK

24,258 views • 1 year ago

Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: (1/7)

Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: (1/7)

Boyuan Chen

175,996 views • 1 year ago

🌟 Create anything in 3D! 🌟 Introducing CAT3D: a new method that generates high-fidelity 3D scenes from any number of real or generated images in one minute, powered by multi-view diffusion models. w/ lovely coauthors Aleksander Holynski, Ben Poole and an amazing team!

🌟 Create anything in 3D! 🌟 Introducing CAT3D: a new method that generates high-fidelity 3D scenes from any number of real or generated images in one minute, powered by multi-view diffusion models. w/ lovely coauthors Aleksander Holynski, Ben Poole and an amazing team!

Ruiqi Gao

152,867 views • 2 years ago