正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

Gordon Wetzstein

5,206 subscribers

163,139 次观看 • 2 个月前 •via X (Twitter)

艺术健康养生科学技术

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Selected as a best paper finalist at #CVPR2026: PixelDiT from NVIDIA Research In most image generation models, a pretrained autoencoder compresses the image before any diffusion happens, causing quality loss that accumulates across the entire pipeline. PixelDiT, or Pixel Diffusion Transformers, removes this step entirely. It's a single-stage model that learns the diffusion process directly in pixel space, end-to-end.

Selected as a best paper finalist at #CVPR2026: PixelDiT from NVIDIA Research In most image generation models, a pretrained autoencoder compresses the image before any diffusion happens, causing quality loss that accumulates across the entire pipeline. PixelDiT, or Pixel Diffusion Transformers, removes this step entirely. It's a single-stage model that learns the diffusion process directly in pixel space, end-to-end.

NVIDIA AI

26,730 次观看 • 12 天前

Chop the gradients ✂️! We found that truncating decoder gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.

Chop the gradients ✂️! We found that truncating decoder gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.

Felix Heide

28,323 次观看 • 2 个月前

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

Lior Alexander

158,539 次观看 • 3 年前

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

MrNeRF

17,028 次观看 • 1 年前

Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts

Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts

AK

47,384 次观看 • 2 年前

🆕 Image super-resolution model just dropped! Superior results even with a single sampling step. 🔥InvSR: Arbitrary-steps Image Super-resolution via Diffusion Inversion.

🆕 Image super-resolution model just dropped! Superior results even with a single sampling step. 🔥InvSR: Arbitrary-steps Image Super-resolution via Diffusion Inversion.

Gradio

53,331 次观看 • 1 年前

The era of ultra-high-resolution imaging has arrived. Modern image sensors exceeding 200 MP resolution are common in smartphones, with over 400 MP sensors under development. However, the large number of pixels poses significant challenges for acquisition and processing, especially on edge devices. Which pixels should be acquired, and when, for bandwidth-efficient imaging and perception? We introduce Policy-based Foveated Imaging and Perception, an on-device, real-time, predictive, and task-aware framework that dynamically allocates sensor resolution to prioritize important regions under specific perception objectives. This paper will be presented at #SIGGRAPH2026! [1/6]

The era of ultra-high-resolution imaging has arrived. Modern image sensors exceeding 200 MP resolution are common in smartphones, with over 400 MP sensors under development. However, the large number of pixels poses significant challenges for acquisition and processing, especially on edge devices. Which pixels should be acquired, and when, for bandwidth-efficient imaging and perception? We introduce Policy-based Foveated Imaging and Perception, an on-device, real-time, predictive, and task-aware framework that dynamically allocates sensor resolution to prioritize important regions under specific perception objectives. This paper will be presented at #SIGGRAPH2026! [1/6]

Gordon Wetzstein

17,978 次观看 • 15 天前

🚨Announcing our #ICLR2025 Oral! 🔥Diffusion LMs are on the rise for parallel text generation! But unlike autoregressive LMs, they struggle with quality, fixed-length constraints & lack of KV caching. 🚀Introducing Block Diffusion—combining autoregressive and diffusion models for the best of both worlds! 👇1/7

🚨Announcing our #ICLR2025 Oral! 🔥Diffusion LMs are on the rise for parallel text generation! But unlike autoregressive LMs, they struggle with quality, fixed-length constraints & lack of KV caching. 🚀Introducing Block Diffusion—combining autoregressive and diffusion models for the best of both worlds! 👇1/7

Marianne Arriola

148,599 次观看 • 1 年前

🚀 Introducing GenLit – Reformulating Single-Image Relighting as Video Generation! We leverage video diffusion models to perform realistic near-field relighting from just a single image—No explicit 3D reconstruction or ray tracing required! No intermediate graphics buffers, directly in the pixel space! 📄 Dive into the paper: 🎥 Project page & demos: 🛠 Code coming soon! #GenerativeAI #ComputerVision #Relighting #DiffusionModels #Graphics 🧵 1/5

🚀 Introducing GenLit – Reformulating Single-Image Relighting as Video Generation! We leverage video diffusion models to perform realistic near-field relighting from just a single image—No explicit 3D reconstruction or ray tracing required! No intermediate graphics buffers, directly in the pixel space! 📄 Dive into the paper: 🎥 Project page & demos: 🛠 Code coming soon! #GenerativeAI #ComputerVision #Relighting #DiffusionModels #Graphics 🧵 1/5

Haven Feng @ CVPR

22,427 次观看 • 1 年前

Switti -- a new scale-wise transformer for text-to-image generation 🦾 🔥 Improved generation of fine-grained details. Outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Switti -- a new scale-wise transformer for text-to-image generation 🦾 🔥 Improved generation of fine-grained details. Outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Gradio

29,267 次观看 • 1 年前

Introducing ⇆ Marigold-DC — our training-free zero-shot approach to monocular Depth Completion with guided diffusion! If you have ever wondered how else a long denoising diffusion schedule can be useful, we have an answer for you! Details 🧵

Introducing ⇆ Marigold-DC — our training-free zero-shot approach to monocular Depth Completion with guided diffusion! If you have ever wondered how else a long denoising diffusion schedule can be useful, we have an answer for you! Details 🧵

Anton Obukhov

39,295 次观看 • 1 年前

🎥 Video generation is hitting the memory wall. As videos get longer, the KV cache quietly explodes — and long-horizon consistency starts to break. We built Quant VideoGen: a training-free KV cache compression method for auto-regressive video diffusion. Instead of storing every KV in high precision, QVG exploits video’s spatiotemporal redundancy with semantic-aware smoothing + progressive residual quantization. 🚀 Up to 7× KV memory reduction ⚡ <4% overhead ✅ Strong long-video quality 🕹️ Deploy HYWorldPlay on your own RTX 5090 locally KV compression is becoming a core scaling primitive — not just for LLMs, but for video generation too. Paper: Code: (1/5)

🎥 Video generation is hitting the memory wall. As videos get longer, the KV cache quietly explodes — and long-horizon consistency starts to break. We built Quant VideoGen: a training-free KV cache compression method for auto-regressive video diffusion. Instead of storing every KV in high precision, QVG exploits video’s spatiotemporal redundancy with semantic-aware smoothing + progressive residual quantization. 🚀 Up to 7× KV memory reduction ⚡ <4% overhead ✅ Strong long-video quality 🕹️ Deploy HYWorldPlay on your own RTX 5090 locally KV compression is becoming a core scaling primitive — not just for LLMs, but for video generation too. Paper: Code: (1/5)

Haocheng Xi

64,106 次观看 • 1 个月前

Excited to share our new work "AvatarPopUp"! "Instant 3D Human Avatar Generation using Image Diffusion Models" 💻 📄

Nikos Kolotouros

14,084 次观看 • 2 年前

Most recent diffusion language model research (that I’ve seen) seems to be using masking as the noising process. It looks like, however, most closed-source models (Google Gemini Diffusion and possibly Inception Labs’ Mercury) use a different noising process, where instead of masking tokens, they replace them with different tokens (either with a random token or a semantically similar token). I wondered how they were getting such high throughput with the latter noising process, since I believed that optimizing inference with KVCache approximation would be more difficult (for various reasons). I visualized this noising process with tiny-diffusion and compared it to normal unmasking, and was very surprised to see how fast the generation “settles” into a reasonable output, and then only slightly refines afterwards, requiring much fewer steps in total. Unmasking (where tokens are never remasked, the typical implementation) is inherently limited in generation speed by the fact that an increase in tokens decoded per step leads to more errors due to the mismatch between individual and marginal token probability distributions we sample from. The token replacement noising process seems to have a much different set of characteristics. Because we sample each token per step, every token makes “progress” towards the final output each iteration (in addition to *potentially* giving other tokens more information in future steps). Generally, masking has outperformed other noising processes, which is probably why most research focused on it (using smaller models). But the paper referred to in the retweet shows that random replacement as a noising process may scale better as model size increases. Big labs might have noticed these results much earlier (due to having drastically more training resources and being able to test larger models), which may explain the discrepancy in the choice of noising process. I’m gonna test this with larger models, since tiny-diffusion only has 10M parameters.

Most recent diffusion language model research (that I’ve seen) seems to be using masking as the noising process. It looks like, however, most closed-source models (Google Gemini Diffusion and possibly Inception Labs’ Mercury) use a different noising process, where instead of masking tokens, they replace them with different tokens (either with a random token or a semantically similar token). I wondered how they were getting such high throughput with the latter noising process, since I believed that optimizing inference with KVCache approximation would be more difficult (for various reasons). I visualized this noising process with tiny-diffusion and compared it to normal unmasking, and was very surprised to see how fast the generation “settles” into a reasonable output, and then only slightly refines afterwards, requiring much fewer steps in total. Unmasking (where tokens are never remasked, the typical implementation) is inherently limited in generation speed by the fact that an increase in tokens decoded per step leads to more errors due to the mismatch between individual and marginal token probability distributions we sample from. The token replacement noising process seems to have a much different set of characteristics. Because we sample each token per step, every token makes “progress” towards the final output each iteration (in addition to potentially giving other tokens more information in future steps). Generally, masking has outperformed other noising processes, which is probably why most research focused on it (using smaller models). But the paper referred to in the retweet shows that random replacement as a noising process may scale better as model size increases. Big labs might have noticed these results much earlier (due to having drastically more training resources and being able to test larger models), which may explain the discrepancy in the choice of noising process. I’m gonna test this with larger models, since tiny-diffusion only has 10M parameters.

Nathan Barry

40,331 次观看 • 5 个月前

NVidia just released PiD: super resolution in pixel space directly from model latents 🔎 4X resolution for any generated image, FAST! 🏎️💨 FLUX.1, 2 and Z-Image (Qwen Image coming) of course, i built a demo: generate 4K images with Z-Image

NVidia just released PiD: super resolution in pixel space directly from model latents 🔎 4X resolution for any generated image, FAST! 🏎️💨 FLUX.1, 2 and Z-Image (Qwen Image coming) of course, i built a demo: generate 4K images with Z-Image

apolinario (poli)

29,948 次观看 • 23 天前

My first PhD paper!🎉We learn *diffusion* models for code generation that learn to directly *edit* syntax trees of programs. The result is a system that can incrementally write code, see the execution output, and debug it. 🧵1/n

My first PhD paper!🎉We learn diffusion models for code generation that learn to directly edit syntax trees of programs. The result is a system that can incrementally write code, see the execution output, and debug it. 🧵1/n

Shreyas Kapur

742,394 次观看 • 2 年前

🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵

🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵

Jialu Li

35,060 次观看 • 1 年前

Diffusions are excellent in creating fantastic images and videos 🔎 We cooked a *diffusion* model to synthesize structured data #ICLR2025 🔥 Introducing TabDiff, a mixed-type diffusion model for generating synthetic tabular data, imputing missing values, and beyond! 🧵 1/n

Diffusions are excellent in creating fantastic images and videos 🔎 We cooked a diffusion model to synthesize structured data #ICLR2025 🔥 Introducing TabDiff, a mixed-type diffusion model for generating synthetic tabular data, imputing missing values, and beyond! 🧵 1/n

Minkai Xu

50,434 次观看 • 1 年前

Google dropped a new AI paper called LUMIERE. It's remarkably flexible, supporting video inpainting, image-to-video, AND stylized video generation tasks. Say hello to “space-time diffusion” for video generation! Now what the heck does that mean exactly?! 🌐⏳ → TL;DR it utilizes a “Space-Time UNet” architecture that generates the full duration of the video in one pass, rather than generating distant keyframes and interpolating between them like prior works. Because the computation is done in this “compressed space-time representation” to generate the full clip at once, it's far more temporally consistent. → Another benefit of generating the full video at once is that you can “direct” the video generation, making it easier to hand off to other models/tasks without having to stitch together partial solutions. You can condition generations on additional inputs, meaning you get the full stack of AI video capabilities – from video inpainting to image-to-video and beyond. → New SOTA for AI video generation? User study results in the paper suggest human evaluators preferred Lumiere over Runway Gen-2, Pika Labs, and Stable Video Diffusion in terms of quality, text alignment AND motion. But as always, we need to get hands-on with this tech when Google *actually* decides to ship it. → Could this end up inside YouTube? Y’all know i’m obsessed with blending reality and imagination – so it’s the video inpainting tech I'm most excited about. I really hope this model finds its way into YouTube's Generative AI efforts, and based on their prior announcements and the list of acknowledgments in the paper I think it might! 🤞🏽 Links: 🔗Paper: 🔗Project:

Google dropped a new AI paper called LUMIERE. It's remarkably flexible, supporting video inpainting, image-to-video, AND stylized video generation tasks. Say hello to “space-time diffusion” for video generation! Now what the heck does that mean exactly?! 🌐⏳ → TL;DR it utilizes a “Space-Time UNet” architecture that generates the full duration of the video in one pass, rather than generating distant keyframes and interpolating between them like prior works. Because the computation is done in this “compressed space-time representation” to generate the full clip at once, it's far more temporally consistent. → Another benefit of generating the full video at once is that you can “direct” the video generation, making it easier to hand off to other models/tasks without having to stitch together partial solutions. You can condition generations on additional inputs, meaning you get the full stack of AI video capabilities – from video inpainting to image-to-video and beyond. → New SOTA for AI video generation? User study results in the paper suggest human evaluators preferred Lumiere over Runway Gen-2, Pika Labs, and Stable Video Diffusion in terms of quality, text alignment AND motion. But as always, we need to get hands-on with this tech when Google actually decides to ship it. → Could this end up inside YouTube? Y’all know i’m obsessed with blending reality and imagination – so it’s the video inpainting tech I'm most excited about. I really hope this model finds its way into YouTube's Generative AI efforts, and based on their prior announcements and the list of acknowledgments in the paper I think it might! 🤞🏽 Links: 🔗Paper: 🔗Project:

Bilawal Sidhu

44,816 次观看 • 2 年前

Diffusion models have amazing image creation abilities. But how well does their generative knowledge transfer to discriminative tasks? We present Diffusion Classifier: strong classification results with pretrained conditional diffusion models, *with no additional training*! 1/9

Diffusion models have amazing image creation abilities. But how well does their generative knowledge transfer to discriminative tasks? We present Diffusion Classifier: strong classification results with pretrained conditional diffusion models, with no additional training! 1/9

Alex Li

95,304 次观看 • 2 年前