Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and... show more

MrNeRF

16,728 subscribers

24,729 görüntüleme • 10 ay önce •via X (Twitter)

Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

MrNeRF

17,017 görüntüleme • 1 yıl önce

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,801 görüntüleme • 1 yıl önce

How do we create realistic models of dressed humans directly from visual data? We introduce PhysAvatar, a framework that estimates the shape, appearance, and physical parameters of dressed human avatars from multi-view videos. Page: (1/6)

How do we create realistic models of dressed humans directly from visual data? We introduce PhysAvatar, a framework that estimates the shape, appearance, and physical parameters of dressed human avatars from multi-view videos. Page: (1/6)

Qingqing Zhao

66,926 görüntüleme • 2 yıl önce

SplatVoxel: History-Aware Novel View Streaming without Temporal Training Contributions: • We propose a hybrid Splat-Voxel feed-forward reconstruction framework that leverages historical information to enable novel view streaming, without relying on multi-view video datasets for training. • We develop an efficient sparse voxel transformer with a coarse-to-fine voxel representation, outperforming existing feed-forward Gaussian splatting methods. • Experiment results demonstrate that our proposed framework enhances novel view synthesis for streaming scene reconstruction, providing better visual quality and reduced temporal artifacts through history-aware modeling.

SplatVoxel: History-Aware Novel View Streaming without Temporal Training Contributions: • We propose a hybrid Splat-Voxel feed-forward reconstruction framework that leverages historical information to enable novel view streaming, without relying on multi-view video datasets for training. • We develop an efficient sparse voxel transformer with a coarse-to-fine voxel representation, outperforming existing feed-forward Gaussian splatting methods. • Experiment results demonstrate that our proposed framework enhances novel view synthesis for streaming scene reconstruction, providing better visual quality and reduced temporal artifacts through history-aware modeling.

MrNeRF

10,823 görüntüleme • 1 yıl önce

We have released the code and weights for our #CVPR2023 paper "Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model"! code: abs: project: The demo is below:

We have released the code and weights for our #CVPR2023 paper "Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model"! code: abs: project: The demo is below:

Artsiom Sanakoyeu

35,710 görüntüleme • 3 yıl önce

WeatherEdit: Controllable Weather Editing with 4D Gaussian Field Contributions: 1. Based on our analysis of weather editing characteristics, we introduce WeatherEdit, a comprehensive and efficient framework for realistic and controllable weather generation. Compared with existing methods that focus on either background editing or static weather effects, a progressive 2D-to-4D transformation process in WeatherEdit enhances adaptability across a wider range of scenarios. 2. We introduce an all-in-one adapter to enable a diffusion model for multi-weather (snowy, rainy, and fog) synthesis, along with a Temporal-View attention to ensure consistent editing across multi-frame and multi-view. 3. We design a 4D Gaussian field for weather particle modeling, enabling plausible simulation of raindrops, snowflakes, and fog with controllable severity. 4. We demonstrate WeatherEdit’s effectiveness in generating realistic, consistent, and controllable weather effects in 3D driving scenes, showcasing its applicability to real-world scenarios.

WeatherEdit: Controllable Weather Editing with 4D Gaussian Field Contributions: 1. Based on our analysis of weather editing characteristics, we introduce WeatherEdit, a comprehensive and efficient framework for realistic and controllable weather generation. Compared with existing methods that focus on either background editing or static weather effects, a progressive 2D-to-4D transformation process in WeatherEdit enhances adaptability across a wider range of scenarios. 2. We introduce an all-in-one adapter to enable a diffusion model for multi-weather (snowy, rainy, and fog) synthesis, along with a Temporal-View attention to ensure consistent editing across multi-frame and multi-view. 3. We design a 4D Gaussian field for weather particle modeling, enabling plausible simulation of raindrops, snowflakes, and fog with controllable severity. 4. We demonstrate WeatherEdit’s effectiveness in generating realistic, consistent, and controllable weather effects in 3D driving scenes, showcasing its applicability to real-world scenarios.

MrNeRF

10,607 görüntüleme • 11 ay önce

🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵

🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵

Jialu Li

35,060 görüntüleme • 1 yıl önce

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 görüntüleme • 1 yıl önce

Adaptive and Temporally Consistent Gaussian Surfels for Multi-view Dynamic Reconstruction Contributions: • A method for efficiently reconstructing dynamic surfaces from multi-view videos using Gaussian surfels. • A unified and gradient-aware densification strategy for optimizing dynamic 3D Gaussians with fine details. • A temporal consistency approach that ensures stable and coherent surface reconstructions across frames by enforcing consistency on curvature maps. • Extensive experiments that demonstrate our method’s advantages including fast training, high-fidelity novel view synthesis, and accurate surface geometry.

Adaptive and Temporally Consistent Gaussian Surfels for Multi-view Dynamic Reconstruction Contributions: • A method for efficiently reconstructing dynamic surfaces from multi-view videos using Gaussian surfels. • A unified and gradient-aware densification strategy for optimizing dynamic 3D Gaussians with fine details. • A temporal consistency approach that ensures stable and coherent surface reconstructions across frames by enforcing consistency on curvature maps. • Extensive experiments that demonstrate our method’s advantages including fast training, high-fidelity novel view synthesis, and accurate surface geometry.

MrNeRF

31,798 görüntüleme • 1 yıl önce

3D Gaussian Splatting for Real-Time Radiance Field Rendering paper page: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

3D Gaussian Splatting for Real-Time Radiance Field Rendering paper page: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

AK

633,252 görüntüleme • 2 yıl önce

We are pleased to announce the availability of Stable Video 4D, our very first video-to-video generation model that allows users to upload a single video and receive dynamic novel-view videos of eight new angles, delivering a new level of versatility and creativity. In conjunction with this announcement, we are releasing a comprehensive technical report detailing the methodologies, challenges, and breakthroughs achieved during the development of this model. Learn more about this release and access the report here:

We are pleased to announce the availability of Stable Video 4D, our very first video-to-video generation model that allows users to upload a single video and receive dynamic novel-view videos of eight new angles, delivering a new level of versatility and creativity. In conjunction with this announcement, we are releasing a comprehensive technical report detailing the methodologies, challenges, and breakthroughs achieved during the development of this model. Learn more about this release and access the report here:

Stability AI

131,114 görüntüleme • 1 yıl önce

We discovered that imposing a spatio-temporal weight space via LoRAs on DIT-based video models unlocks powerful customization! It captures dynamic concepts with precision and even enables composition of multiple videos together!🎥✨

We discovered that imposing a spatio-temporal weight space via LoRAs on DIT-based video models unlocks powerful customization! It captures dynamic concepts with precision and even enables composition of multiple videos together!🎥✨

Kfir Aberman ✈️ CVPR

59,495 görüntüleme • 1 yıl önce

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

AK

375,080 görüntüleme • 3 yıl önce

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

Lior Alexander

158,539 görüntüleme • 3 yıl önce

1/ Happy to share VADER: Video Diffusion Alignment via Reward Gradients. We adapt foundational video diffusion models using pre-trained reward models to generate high-quality, aligned videos for various end-applications. Below we generated a short movie using VADER 😀, we used ChatGPT to write a script and an off-the-shelf AI music generator to generate the sound. Our code & weights are open-sourced:

1/ Happy to share VADER: Video Diffusion Alignment via Reward Gradients. We adapt foundational video diffusion models using pre-trained reward models to generate high-quality, aligned videos for various end-applications. Below we generated a short movie using VADER 😀, we used ChatGPT to write a script and an off-the-shelf AI music generator to generate the sound. Our code & weights are open-sourced:

Mihir Prabhudesai

13,368 görüntüleme • 1 yıl önce

(1/2) MonoNPHM will be presented as a #CVPR2024 Highlight! Our Neural Parametric Head Model parametrizes both geometry and appearance. With the learned model, we can then 3D reconstruct and track human heads from images or videos.

(1/2) MonoNPHM will be presented as a #CVPR2024 Highlight! Our Neural Parametric Head Model parametrizes both geometry and appearance. With the learned model, we can then 3D reconstruct and track human heads from images or videos.

Matthias Niessner

17,209 görüntüleme • 2 yıl önce

The context size of video world models is only a few frames. Like a human with severe memory loss! We design a long-term memory for world models based on explicit 3D representations inspired by the human mind. This enables long-term consistency. 1/3

The context size of video world models is only a few frames. Like a human with severe memory loss! We design a long-term memory for world models based on explicit 3D representations inspired by the human mind. This enables long-term consistency. 1/3

Gordon Wetzstein

34,796 görüntüleme • 1 yıl önce

we open sourced the code to transform human videos into robot trajectories, so you can train robots with your hands 👐🏻 we used it in our recent paper R+X: Retrieval and Execution from Everyday Human Videos (ICRA 2025 🇺🇸) link and details in thread 🧵

we open sourced the code to transform human videos into robot trajectories, so you can train robots with your hands 👐🏻 we used it in our recent paper R+X: Retrieval and Execution from Everyday Human Videos (ICRA 2025 🇺🇸) link and details in thread 🧵

Norman Di Palo

10,578 görüntüleme • 1 yıl önce

How can a visuomotor policy learn from internet videos? We introduce Dreamitate, where a robot uses a fine-tuned video diffusion model to dream the future (top) and imitate the dream to accomplish a task (bottom). website: paper:

How can a visuomotor policy learn from internet videos? We introduce Dreamitate, where a robot uses a fine-tuned video diffusion model to dream the future (top) and imitate the dream to accomplish a task (bottom). website: paper:

Ruoshi Liu

50,787 görüntüleme • 1 yıl önce