Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Chop the gradients ✂️! We found that truncating decoder gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled... to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.show more

Felix Heide

1,951 subscribers

28,323 просмотров • 2 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

Lior Alexander

158,545 просмотров • 3 лет назад

Selected as a best paper finalist at #CVPR2026: PixelDiT from NVIDIA Research In most image generation models, a pretrained autoencoder compresses the image before any diffusion happens, causing quality loss that accumulates across the entire pipeline. PixelDiT, or Pixel Diffusion Transformers, removes this step entirely. It's a single-stage model that learns the diffusion process directly in pixel space, end-to-end.

Selected as a best paper finalist at #CVPR2026: PixelDiT from NVIDIA Research In most image generation models, a pretrained autoencoder compresses the image before any diffusion happens, causing quality loss that accumulates across the entire pipeline. PixelDiT, or Pixel Diffusion Transformers, removes this step entirely. It's a single-stage model that learns the diffusion process directly in pixel space, end-to-end.

NVIDIA AI

27,469 просмотров • 21 дней назад

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,801 просмотров • 1 год назад

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

MrNeRF

17,037 просмотров • 1 год назад

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

Gordon Wetzstein

163,340 просмотров • 3 месяцев назад

1/ Happy to share VADER: Video Diffusion Alignment via Reward Gradients. We adapt foundational video diffusion models using pre-trained reward models to generate high-quality, aligned videos for various end-applications. Below we generated a short movie using VADER 😀, we used ChatGPT to write a script and an off-the-shelf AI music generator to generate the sound. Our code & weights are open-sourced:

1/ Happy to share VADER: Video Diffusion Alignment via Reward Gradients. We adapt foundational video diffusion models using pre-trained reward models to generate high-quality, aligned videos for various end-applications. Below we generated a short movie using VADER 😀, we used ChatGPT to write a script and an off-the-shelf AI music generator to generate the sound. Our code & weights are open-sourced:

Mihir Prabhudesai

13,368 просмотров • 1 год назад

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering discuss: The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering discuss: The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

AK

19,101 просмотров • 1 год назад

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

AK

375,090 просмотров • 3 лет назад

🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵

🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵

Jialu Li

35,060 просмотров • 1 год назад

Diffusions are excellent in creating fantastic images and videos 🔎 We cooked a *diffusion* model to synthesize structured data #ICLR2025 🔥 Introducing TabDiff, a mixed-type diffusion model for generating synthetic tabular data, imputing missing values, and beyond! 🧵 1/n

Diffusions are excellent in creating fantastic images and videos 🔎 We cooked a diffusion model to synthesize structured data #ICLR2025 🔥 Introducing TabDiff, a mixed-type diffusion model for generating synthetic tabular data, imputing missing values, and beyond! 🧵 1/n

Minkai Xu

50,434 просмотров • 1 год назад

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and temporal consistency of generated long-term videos while maintaining efficient inference. • We design a human pose conditioning scheme to enhance the appearance quality and motion accuracy of generated human videos. • We plan to release our processed version of the DNA-Rendering dataset, which we believe will benefit future research in this area.

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and temporal consistency of generated long-term videos while maintaining efficient inference. • We design a human pose conditioning scheme to enhance the appearance quality and motion accuracy of generated human videos. • We plan to release our processed version of the DNA-Rendering dataset, which we believe will benefit future research in this area.

MrNeRF

24,729 просмотров • 11 месяцев назад

LLaDA (the first Large Language Diffusion Model) is *just* out 💥 and I've built a demo, try out now 👨‍💻 It's mesmerizing to watch the diffusion process 🌀, and it being a diffusion model gives you superpowers like "the 4th word has to be pineapple" 🦸 Demo and weights 👇

LLaDA (the first Large Language Diffusion Model) is just out 💥 and I've built a demo, try out now 👨‍💻 It's mesmerizing to watch the diffusion process 🌀, and it being a diffusion model gives you superpowers like "the 4th word has to be pineapple" 🦸 Demo and weights 👇

apolinario (poli)

82,599 просмотров • 1 год назад

The latent space of earlier generative models like GANS can linearly encode concepts of the data. What if the data was model weights? We present weights2weights, a subspace in diffusion weights that behaves as an interpretable latent space over customized diffusion models.

The latent space of earlier generative models like GANS can linearly encode concepts of the data. What if the data was model weights? We present weights2weights, a subspace in diffusion weights that behaves as an interpretable latent space over customized diffusion models.

Amil Dravid

94,276 просмотров • 2 лет назад

1/ Happy to share UniDisc - Unified Multimodal Discrete Diffusion – We train a 1.5 billion parameter transformer model from scratch on 250 million image/caption pairs using a **discrete diffusion objective**. Our model has all the benefits of diffusion models but now in multimodal space! - flexible compute-quality tradeoff, zero-shot inpainting and editing, better control via classifier-free guidance and lower latency! We open source everything - our code, weights and the training dataset.

1/ Happy to share UniDisc - Unified Multimodal Discrete Diffusion – We train a 1.5 billion parameter transformer model from scratch on 250 million image/caption pairs using a discrete diffusion objective. Our model has all the benefits of diffusion models but now in multimodal space! - flexible compute-quality tradeoff, zero-shot inpainting and editing, better control via classifier-free guidance and lower latency! We open source everything - our code, weights and the training dataset.

Mihir Prabhudesai

104,862 просмотров • 1 год назад

How can a visuomotor policy learn from internet videos? We introduce Dreamitate, where a robot uses a fine-tuned video diffusion model to dream the future (top) and imitate the dream to accomplish a task (bottom). website: paper:

How can a visuomotor policy learn from internet videos? We introduce Dreamitate, where a robot uses a fine-tuned video diffusion model to dream the future (top) and imitate the dream to accomplish a task (bottom). website: paper:

Ruoshi Liu

50,797 просмотров • 2 лет назад

Can you make a jigsaw puzzle with two different solutions? Or an image that changes appearance when flipped? We can do that, and a lot more, by using diffusion models to generate optical illusions! Continue reading for more illusions and method details 🧵

Can you make a jigsaw puzzle with two different solutions? Or an image that changes appearance when flipped? We can do that, and a lot more, by using diffusion models to generate optical illusions! Continue reading for more illusions and method details 🧵

Daniel Geng

125,796 просмотров • 2 лет назад

I made a tool called Diffusion Explorer that lets you to train and visualize simple 2D diffusion and flow models live in the browser. You can draw your own distributions and observe how the generated samples converge during training. Try it live 👇

I made a tool called Diffusion Explorer that lets you to train and visualize simple 2D diffusion and flow models live in the browser. You can draw your own distributions and observe how the generated samples converge during training. Try it live 👇

Alec Helbling

73,103 просмотров • 1 год назад

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 просмотров • 1 год назад

You can't 3D reconstruct glass from images... ...WRONG! Thanks for video diffusion, now just about anything is possible! Introducing...Diffusion Knows Transparency (DKT) Transparent and reflective objects usually break robot vision and photogrammetry pipelines because they don't follow the "solid object" rules standard cameras expect. DKT is a new AI model that repurposes the "internal physics engine" found in video generation models to solve this problem. Researchers took a massive video diffusion model (WAN) and fine-tuned it using a custom-built synthetic dataset to turn it into a high-precision depth sensor. To train the AI, they built the first massive synthetic video library of transparent objects, 1.32 million frames of perfectly labeled glass and metal objects in motion. Without ever seeing a "real" labeled video of glass during training, the model (DKT) outperformed all previous specialized systems on real-world benchmarks (ClearPose, DREDS). They created a "lightweight" 1.3B parameter version that runs fast enough (0.17s per frame) to be used on actual robot hardware. Two reasons I find this project important: 1. It further proves that synthetic data will be essential for training the next generation vision models. 2. In real-world robotic tests, using DKT's depth maps nearly doubled the success rate of robot arms trying to pick up objects on tricky reflective or translucent surfaces. At home robots will need to interact with these types of objects on a daily basis. Check out the project page here: Code is LIVE! #Computervision #Robotics #AI

You can't 3D reconstruct glass from images... ...WRONG! Thanks for video diffusion, now just about anything is possible! Introducing...Diffusion Knows Transparency (DKT) Transparent and reflective objects usually break robot vision and photogrammetry pipelines because they don't follow the "solid object" rules standard cameras expect. DKT is a new AI model that repurposes the "internal physics engine" found in video generation models to solve this problem. Researchers took a massive video diffusion model (WAN) and fine-tuned it using a custom-built synthetic dataset to turn it into a high-precision depth sensor. To train the AI, they built the first massive synthetic video library of transparent objects, 1.32 million frames of perfectly labeled glass and metal objects in motion. Without ever seeing a "real" labeled video of glass during training, the model (DKT) outperformed all previous specialized systems on real-world benchmarks (ClearPose, DREDS). They created a "lightweight" 1.3B parameter version that runs fast enough (0.17s per frame) to be used on actual robot hardware. Two reasons I find this project important: 1. It further proves that synthetic data will be essential for training the next generation vision models. 2. In real-world robotic tests, using DKT's depth maps nearly doubled the success rate of robot arms trying to pick up objects on tricky reflective or translucent surfaces. At home robots will need to interact with these types of objects on a daily basis. Check out the project page here: Code is LIVE! #Computervision #Robotics #AI

Jonathan Stephens

17,712 просмотров • 5 месяцев назад

Diffusion models are sensitive to small changes in the input noise. We introduce Alias-Free Latent Diffusion Models (𝗔𝗙-𝗟𝗗𝗠) at #CVPR2025. It achieves shift-equivariance and generates consistent outputs. Project: arXiv:

Diffusion models are sensitive to small changes in the input noise. We introduce Alias-Free Latent Diffusion Models (𝗔𝗙-𝗟𝗗𝗠) at #CVPR2025. It achieves shift-equivariance and generates consistent outputs. Project: arXiv:

Xingang Pan

42,538 просмотров • 1 год назад