Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Diffusion models are sensitive to small changes in the input noise. We introduce Alias-Free Latent Diffusion Models (𝗔𝗙-𝗟𝗗𝗠) at #CVPR2025. It achieves shift-equivariance and generates consistent outputs. Project: arXiv:

Xingang Pan

3,257 subscribers

42,538 views • 1 year ago •via X (Twitter)

Education Science & Technology #CVPR2025

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

The latent space of earlier generative models like GANS can linearly encode concepts of the data. What if the data was model weights? We present weights2weights, a subspace in diffusion weights that behaves as an interpretable latent space over customized diffusion models.

The latent space of earlier generative models like GANS can linearly encode concepts of the data. What if the data was model weights? We present weights2weights, a subspace in diffusion weights that behaves as an interpretable latent space over customized diffusion models.

Amil Dravid

94,226 views • 2 years ago

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models "Block diffusion sequentially generates blocks of tokens by performing diffusion within each block and conditioning on previous blocks. By combining strength from autoregressive and diffusion models, block diffusion overcomes the limitations of both approaches by supporting variable-length, higher-quality generation and improving inference efficiency with KV caching and parallel sampling."

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models "Block diffusion sequentially generates blocks of tokens by performing diffusion within each block and conditioning on previous blocks. By combining strength from autoregressive and diffusion models, block diffusion overcomes the limitations of both approaches by supporting variable-length, higher-quality generation and improving inference efficiency with KV caching and parallel sampling."

Tanishq Mathew Abraham, Ph.D.

21,813 views • 1 year ago

NVIDIA's PiD is a new pixel diffusion decoder for high-res image models. It skips decode-then-upscale stop, making sharper outputs faster. > Directly generates 4K images > Up to 5.9x faster than SeedVR2 > Free & open weights

NVIDIA's PiD is a new pixel diffusion decoder for high-res image models. It skips decode-then-upscale stop, making sharper outputs faster. > Directly generates 4K images > Up to 5.9x faster than SeedVR2 > Free & open weights

⚡AI Search⚡

12,372 views • 11 days ago

ByteDance open source SOTA lip-sync model LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

ByteDance open source SOTA lip-sync model LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

青龍聖者

136,491 views • 1 year ago

🚨New paper! Generative models are often “miscalibrated”. We calibrate diffusion models, LLMs, and more to meet desired distributional properties. E.g. we finetune protein models to better match the diversity of natural proteins.

🚨New paper! Generative models are often “miscalibrated”. We calibrate diffusion models, LLMs, and more to meet desired distributional properties. E.g. we finetune protein models to better match the diversity of natural proteins.

Brian L Trippe

20,409 views • 7 months ago

Diffusion models have amazing image creation abilities. But how well does their generative knowledge transfer to discriminative tasks? We present Diffusion Classifier: strong classification results with pretrained conditional diffusion models, *with no additional training*! 1/9

Diffusion models have amazing image creation abilities. But how well does their generative knowledge transfer to discriminative tasks? We present Diffusion Classifier: strong classification results with pretrained conditional diffusion models, with no additional training! 1/9

Alex Li

95,304 views • 2 years ago

(1/2) Check out "𝐏𝐨𝐥𝐲𝐃𝐢𝐟𝐟: Generating 3D Polygonal Meshes with Diffusion Models"! Our model operates directly on the polygons of 3D meshes and generates novel shapes as output through an iterative diffusion process.

(1/2) Check out "𝐏𝐨𝐥𝐲𝐃𝐢𝐟𝐟: Generating 3D Polygonal Meshes with Diffusion Models"! Our model operates directly on the polygons of 3D meshes and generates novel shapes as output through an iterative diffusion process.

Matthias Niessner

57,912 views • 2 years ago

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,801 views • 1 year ago

Introducing 📦𝗔𝗿𝘁𝗶𝗟𝗮𝘁𝗲𝗻𝘁🔧 (SIGGRAPH Asia 2025) — a high-quality 3D diffusion model that explicitly models object articulation, paving the way for richer, more realistic assets in embodied AI and simulation: – Generates fully articulated 3D objects – Physically plausible joints & motion – High-fidelity 3D Gaussian appearance – Supports generation from a single real image arXiv: Project: Code (coming soon):

Introducing 📦𝗔𝗿𝘁𝗶𝗟𝗮𝘁𝗲𝗻𝘁🔧 (SIGGRAPH Asia 2025) — a high-quality 3D diffusion model that explicitly models object articulation, paving the way for richer, more realistic assets in embodied AI and simulation: – Generates fully articulated 3D objects – Physically plausible joints & motion – High-fidelity 3D Gaussian appearance – Supports generation from a single real image arXiv: Project: Code (coming soon):

Xingang Pan

11,473 views • 6 months ago

We introduce TurboEdit -- simple text-based image editing in 1/2 sec! We leverage few-step diffusion models, mapping real images into noise with an encoder. Please see our #ECCV2024 paper: Work w/ Zongze Wu, Nick Kolkin, Jon Brandt, Eli Shechtman 1/

We introduce TurboEdit -- simple text-based image editing in 1/2 sec! We leverage few-step diffusion models, mapping real images into noise with an encoder. Please see our #ECCV2024 paper: Work w/ Zongze Wu, Nick Kolkin, Jon Brandt, Eli Shechtman 1/

Richard Zhang

39,306 views • 1 year ago

This symmetric diffusion paper at ICLR is nice (simple idea in retrospect): SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models We'd actually implemented this idea internally at Orbital, and it works nicely even for very large crystal structures:

This symmetric diffusion paper at ICLR is nice (simple idea in retrospect): SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models We'd actually implemented this idea internally at Orbital, and it works nicely even for very large crystal structures:

Mark Neumann

18,188 views • 1 year ago

1/ Happy to share VADER: Video Diffusion Alignment via Reward Gradients. We adapt foundational video diffusion models using pre-trained reward models to generate high-quality, aligned videos for various end-applications. Below we generated a short movie using VADER 😀, we used ChatGPT to write a script and an off-the-shelf AI music generator to generate the sound. Our code & weights are open-sourced:

1/ Happy to share VADER: Video Diffusion Alignment via Reward Gradients. We adapt foundational video diffusion models using pre-trained reward models to generate high-quality, aligned videos for various end-applications. Below we generated a short movie using VADER 😀, we used ChatGPT to write a script and an off-the-shelf AI music generator to generate the sound. Our code & weights are open-sourced:

Mihir Prabhudesai

13,330 views • 1 year ago

Google is making progress on their diffusion models... It's now as good as a Gemini 2.0 Flash Lite. The writing is on the wall, a majority of language AI use in the future will be diffusion models.

Google is making progress on their diffusion models... It's now as good as a Gemini 2.0 Flash Lite. The writing is on the wall, a majority of language AI use in the future will be diffusion models.

Carlos E. Perez

193,952 views • 5 months ago

I made a tool called Diffusion Explorer that lets you to train and visualize simple 2D diffusion and flow models live in the browser. You can draw your own distributions and observe how the generated samples converge during training. Try it live 👇

I made a tool called Diffusion Explorer that lets you to train and visualize simple 2D diffusion and flow models live in the browser. You can draw your own distributions and observe how the generated samples converge during training. Try it live 👇

Alec Helbling

73,076 views • 1 year ago

Stochastic and deterministic sampling strategies for diffusion models produce strikingly different trajectories, but both ultimately achieve the same aim. I had a great time presenting our work, Diffusion Explorer, this week at IEEE VIS in Vienna.

Stochastic and deterministic sampling strategies for diffusion models produce strikingly different trajectories, but both ultimately achieve the same aim. I had a great time presenting our work, Diffusion Explorer, this week at IEEE VIS in Vienna.

Alec Helbling

278,997 views • 6 months ago

MMaDA-Parallel Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

MMaDA-Parallel Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

AK

66,105 views • 6 months ago

1/ Happy to share UniDisc - Unified Multimodal Discrete Diffusion – We train a 1.5 billion parameter transformer model from scratch on 250 million image/caption pairs using a **discrete diffusion objective**. Our model has all the benefits of diffusion models but now in multimodal space! - flexible compute-quality tradeoff, zero-shot inpainting and editing, better control via classifier-free guidance and lower latency! We open source everything - our code, weights and the training dataset.

1/ Happy to share UniDisc - Unified Multimodal Discrete Diffusion – We train a 1.5 billion parameter transformer model from scratch on 250 million image/caption pairs using a discrete diffusion objective. Our model has all the benefits of diffusion models but now in multimodal space! - flexible compute-quality tradeoff, zero-shot inpainting and editing, better control via classifier-free guidance and lower latency! We open source everything - our code, weights and the training dataset.

Mihir Prabhudesai

104,862 views • 1 year ago

Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts

Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts

AK

47,384 views • 2 years ago

This smalldiffusion project is really fun. It is a tiny library for training and sampling from diffusion models. I love efforts like this, they are much easier to play around with than big libraries like diffusers. Link:

This smalldiffusion project is really fun. It is a tiny library for training and sampling from diffusion models. I love efforts like this, they are much easier to play around with than big libraries like diffusers. Link:

Alec Helbling

18,821 views • 1 year ago

Chop the gradients ✂️! We found that truncating decoder gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.

Chop the gradients ✂️! We found that truncating decoder gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.

Felix Heide

28,282 views • 1 month ago