Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and... show more

MrNeRF

16,728 subscribers

24,729 views • 1 year ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

MrNeRF

17,047 views • 1 year ago

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,801 views • 1 year ago

SplatVoxel: History-Aware Novel View Streaming without Temporal Training Contributions: • We propose a hybrid Splat-Voxel feed-forward reconstruction framework that leverages historical information to enable novel view streaming, without relying on multi-view video datasets for training. • We develop an efficient sparse voxel transformer with a coarse-to-fine voxel representation, outperforming existing feed-forward Gaussian splatting methods. • Experiment results demonstrate that our proposed framework enhances novel view synthesis for streaming scene reconstruction, providing better visual quality and reduced temporal artifacts through history-aware modeling.

SplatVoxel: History-Aware Novel View Streaming without Temporal Training Contributions: • We propose a hybrid Splat-Voxel feed-forward reconstruction framework that leverages historical information to enable novel view streaming, without relying on multi-view video datasets for training. • We develop an efficient sparse voxel transformer with a coarse-to-fine voxel representation, outperforming existing feed-forward Gaussian splatting methods. • Experiment results demonstrate that our proposed framework enhances novel view synthesis for streaming scene reconstruction, providing better visual quality and reduced temporal artifacts through history-aware modeling.

MrNeRF

10,823 views • 1 year ago

WeatherEdit: Controllable Weather Editing with 4D Gaussian Field Contributions: 1. Based on our analysis of weather editing characteristics, we introduce WeatherEdit, a comprehensive and efficient framework for realistic and controllable weather generation. Compared with existing methods that focus on either background editing or static weather effects, a progressive 2D-to-4D transformation process in WeatherEdit enhances adaptability across a wider range of scenarios. 2. We introduce an all-in-one adapter to enable a diffusion model for multi-weather (snowy, rainy, and fog) synthesis, along with a Temporal-View attention to ensure consistent editing across multi-frame and multi-view. 3. We design a 4D Gaussian field for weather particle modeling, enabling plausible simulation of raindrops, snowflakes, and fog with controllable severity. 4. We demonstrate WeatherEdit’s effectiveness in generating realistic, consistent, and controllable weather effects in 3D driving scenes, showcasing its applicability to real-world scenarios.

WeatherEdit: Controllable Weather Editing with 4D Gaussian Field Contributions: 1. Based on our analysis of weather editing characteristics, we introduce WeatherEdit, a comprehensive and efficient framework for realistic and controllable weather generation. Compared with existing methods that focus on either background editing or static weather effects, a progressive 2D-to-4D transformation process in WeatherEdit enhances adaptability across a wider range of scenarios. 2. We introduce an all-in-one adapter to enable a diffusion model for multi-weather (snowy, rainy, and fog) synthesis, along with a Temporal-View attention to ensure consistent editing across multi-frame and multi-view. 3. We design a 4D Gaussian field for weather particle modeling, enabling plausible simulation of raindrops, snowflakes, and fog with controllable severity. 4. We demonstrate WeatherEdit’s effectiveness in generating realistic, consistent, and controllable weather effects in 3D driving scenes, showcasing its applicability to real-world scenarios.

MrNeRF

10,691 views • 1 year ago

🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵

🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵

Jialu Li

35,060 views • 1 year ago

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 views • 1 year ago

3D Gaussian Splatting for Real-Time Radiance Field Rendering paper page: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

3D Gaussian Splatting for Real-Time Radiance Field Rendering paper page: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

AK

633,532 views • 2 years ago

We are pleased to announce the availability of Stable Video 4D, our very first video-to-video generation model that allows users to upload a single video and receive dynamic novel-view videos of eight new angles, delivering a new level of versatility and creativity. In conjunction with this announcement, we are releasing a comprehensive technical report detailing the methodologies, challenges, and breakthroughs achieved during the development of this model. Learn more about this release and access the report here:

We are pleased to announce the availability of Stable Video 4D, our very first video-to-video generation model that allows users to upload a single video and receive dynamic novel-view videos of eight new angles, delivering a new level of versatility and creativity. In conjunction with this announcement, we are releasing a comprehensive technical report detailing the methodologies, challenges, and breakthroughs achieved during the development of this model. Learn more about this release and access the report here:

Stability AI

131,114 views • 2 years ago

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

AK

375,123 views • 3 years ago

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians paper page: Creating high-fidelity 3D head avatars has always been a research hotspot, but there remains a great challenge under lightweight sparse view setups. In this paper, we propose Gaussian Head Avatar represented by controllable 3D Gaussians for high-fidelity head avatar modeling. We optimize the neutral 3D Gaussians and a fully learned MLP-based deformation field to capture complex expressions. The two parts benefit each other, thereby our method can model fine-grained dynamic details while ensuring expression accuracy. Furthermore, we devise a well-designed geometry-guided initialization strategy based on implicit SDF and Deep Marching Tetrahedra for the stability and convergence of the training procedure. Experiments show our approach outperforms other state-of-the-art sparse-view methods, achieving ultra high-fidelity rendering quality at 2K resolution even under exaggerated expressions.

AK

65,847 views • 2 years ago

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering discuss: The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering discuss: The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

AK

19,101 views • 1 year ago

Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:

Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:

AK

154,511 views • 3 years ago

Chop the gradients ✂️! We found that truncating decoder gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.

Chop the gradients ✂️! We found that truncating decoder gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.

Felix Heide

28,323 views • 3 months ago

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."

MrNeRF

17,206 views • 1 year ago

Human Hair Reconstruction with Strand-Aligned 3D Gaussians Contributions (cited): – We propose a new 3D line lifting scheme that uses a modified 3DGS reconstruction technique to lift 2D orientation maps into a 3D field while also providing refinement of the camera parameters; – We introduce a dual representation of hair strand polylines and 3D Gaussians to achieve differentiable rasterization of hair strands and leverage photometric constraints for strand-based hair reconstruction; – Based on these components, we propose a coarse-to-fine optimization method for prior-guided hair reconstruction that leverages both latent and explicit representations of the hairstyle.

Human Hair Reconstruction with Strand-Aligned 3D Gaussians Contributions (cited): – We propose a new 3D line lifting scheme that uses a modified 3DGS reconstruction technique to lift 2D orientation maps into a 3D field while also providing refinement of the camera parameters; – We introduce a dual representation of hair strand polylines and 3D Gaussians to achieve differentiable rasterization of hair strands and leverage photometric constraints for strand-based hair reconstruction; – Based on these components, we propose a coarse-to-fine optimization method for prior-guided hair reconstruction that leverages both latent and explicit representations of the hairstyle.

MrNeRF

106,525 views • 1 year ago

PackUV: Packed Gaussian UV Maps for 4D Volumetric Video - PackUV — A new volumetric video representation that packs 3D Gaussian attributes into a sequence of UV atlases for efficient streaming and storage, making it readily compatible with existing video coding infrastructure. - PackUV-GS — An efficient method to fit PackUV directly from multiview videos using optical-flow-based keyframing and Gaussian labeling to handle large motions, disocclusions, and temporal consistency. - PackUV-2B — The largest multi-view 4D dataset with 2B frames, large motions, and disocclusions. It provides 360° coverage from 50+ synchronized cameras.

PackUV: Packed Gaussian UV Maps for 4D Volumetric Video - PackUV — A new volumetric video representation that packs 3D Gaussian attributes into a sequence of UV atlases for efficient streaming and storage, making it readily compatible with existing video coding infrastructure. - PackUV-GS — An efficient method to fit PackUV directly from multiview videos using optical-flow-based keyframing and Gaussian labeling to handle large motions, disocclusions, and temporal consistency. - PackUV-2B — The largest multi-view 4D dataset with 2B frames, large motions, and disocclusions. It provides 360° coverage from 50+ synchronized cameras.

MrNeRF

17,035 views • 4 months ago

Is Google taking initial steps to enhance Street View? For some reason, Street View seems stuck in technology that feels outdated. I wonder if we'll see such improvements on the product side. Also, note how much better it performs in all aspects compared to Zip-NeRF in their presented material. It offers more details and fewer artifacts. Great work! "LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering" Contributions: • We propose a novel LOD representation for 3DGS which, unlike previous methods [27, 28, 17], does not recompute the list of used Gaussians at each frame. This allows for acceleration and compaction, enabling the rendering of large-scale scenes even on mobile devices. • We design a strategy to automatically select optimal hyperparameters for splitting LODs, whereas most other methods require manual tuning of hyperparameters for each 3D scene. • To further accelerate rendering, we split the scene into chunks and pre-compute sets of active Gaussians per chunk. • Finally, we introduce a novel opacity interpolation scheme to produce visually pleasing rendering and eliminate artifacts when transitioning between chunks.

Is Google taking initial steps to enhance Street View? For some reason, Street View seems stuck in technology that feels outdated. I wonder if we'll see such improvements on the product side. Also, note how much better it performs in all aspects compared to Zip-NeRF in their presented material. It offers more details and fewer artifacts. Great work! "LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering" Contributions: • We propose a novel LOD representation for 3DGS which, unlike previous methods [27, 28, 17], does not recompute the list of used Gaussians at each frame. This allows for acceleration and compaction, enabling the rendering of large-scale scenes even on mobile devices. • We design a strategy to automatically select optimal hyperparameters for splitting LODs, whereas most other methods require manual tuning of hyperparameters for each 3D scene. • To further accelerate rendering, we split the scene into chunks and pre-compute sets of active Gaussians per chunk. • Finally, we introduce a novel opacity interpolation scheme to produce visually pleasing rendering and eliminate artifacts when transitioning between chunks.

MrNeRF

62,564 views • 1 year ago

FastMap: Revisiting Dense and Scalable Structure from Motion "FASTMAP, a redesigned SfM framework, achieves fast, high-accuracy dense structure from motion. On large scenes with thousands of images, FASTMAP is up to one to two orders of magnitude faster than GLOMAP and COLMAP. ... Importantly, FASTMAP achieves efficiency improvements while keeping comparable performance. Extensive experiments on eight datasets demonstrate pose estimation accuracy and novel view synthesis quality close to GLOMAP and COLMAP. " Contributions: 1. For all the iterative nonlinear optimization problems involved, we design algorithms such that the computational complexity of each iteration is only linear in the number of image pairs, not keypoint pairs or 3D points. This includes replacing the traditional bundle adjustment [50] present in previous SfM frameworks with a novel re-weighting epipolar adjustment algorithm, which is much more efficient. 2. Throughout the entire framework, we formulate as many steps as possible as GPU-friendly dense tensor operations. This allows us to implement the entire method in PyTorch [39], which provides seamless GPU acceleration.

FastMap: Revisiting Dense and Scalable Structure from Motion "FASTMAP, a redesigned SfM framework, achieves fast, high-accuracy dense structure from motion. On large scenes with thousands of images, FASTMAP is up to one to two orders of magnitude faster than GLOMAP and COLMAP. ... Importantly, FASTMAP achieves efficiency improvements while keeping comparable performance. Extensive experiments on eight datasets demonstrate pose estimation accuracy and novel view synthesis quality close to GLOMAP and COLMAP. " Contributions: 1. For all the iterative nonlinear optimization problems involved, we design algorithms such that the computational complexity of each iteration is only linear in the number of image pairs, not keypoint pairs or 3D points. This includes replacing the traditional bundle adjustment [50] present in previous SfM frameworks with a novel re-weighting epipolar adjustment algorithm, which is much more efficient. 2. Throughout the entire framework, we formulate as many steps as possible as GPU-friendly dense tensor operations. This allows us to implement the entire method in PyTorch [39], which provides seamless GPU acceleration.

MrNeRF

15,233 views • 1 year ago

Generative 3d environments just became a thing with the announcement of OpenAI's new video model, Sora. Michael Rublof from took one of those videos, and turned it into a NeRF using Colmap and Nerfstudio. While people are laughing at the topology of generated models, the world is changing around us, and that's exciting and a little scary, but we'll find a way to turn Gen Ai into creative superpowers. I believe in human creativity, in our ability to surprise, to move, to change and to challenge. Here's to the future! #ai #art

Generative 3d environments just became a thing with the announcement of OpenAI's new video model, Sora. Michael Rublof from took one of those videos, and turned it into a NeRF using Colmap and Nerfstudio. While people are laughing at the topology of generated models, the world is changing around us, and that's exciting and a little scary, but we'll find a way to turn Gen Ai into creative superpowers. I believe in human creativity, in our ability to surprise, to move, to change and to challenge. Here's to the future! #ai #art

Martin Nebelong

288,953 views • 2 years ago

DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery Abstract: Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.

DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery Abstract: Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.

MrNeRF

21,346 views • 1 year ago