Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos Abstract: We propose 4DGT, a 4D Gaussian-based Transformer model for dynamic scene reconstruction, trained entirely on real-world monocular posed videos. Using 4D Gaussian as an inductive bias, 4DGT unifies static and dynamic components, enabling the modeling of complex, time-varying... environments with varying object lifespans. We introduced a novel density control strategy in training, which allows our 4DGT to handle longer space-time input while maintaining efficient rendering at runtime. Our model processes 64 consecutive posed frames in a rolling-window fashion, predicting consistent 4D Gaussians in the scene. Unlike optimization-based methods, 4DGT performs purely feed-forward inference, reducing reconstruction time from hours to seconds and scaling effectively to long video sequences. Trained only on large-scale monocular posed video datasets, 4DGT can significantly outperform prior Gaussian-based networks in real-world videos and achieve on-par accuracy with optimization-based methods on cross-domain videos.show more

MrNeRF

15,896 subscribers

34,782 Aufrufe • vor 1 Jahr •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

11 Kommentare

Profilbild von MrNeRF

MrNeRFvor 1 Jahr

Paper: not yet Project: "4DGT takes a series of monocular frames with poses as input. During training, we subsample the temporal frames at different granularity and use all images for supervision. In stage one, we train 4DGT to predict pixel-aligned Gaussians at coarse resolution. In stage two, we prune a majority of non-activated Gaussians based on the histograms of per-patch activation channels and densify the Gaussian prediction by increasing the input token samples in both space and time. At inference time, we run the 4DGT network trained after stage two, which supports dense video frames input at high resolution."

Profilbild von MrNeRF

MrNeRFvor 1 Jahr

Paper:

Profilbild von Pablo Vela

Pablo Velavor 1 Jahr

Wow this looks really sick

Profilbild von MrNeRF

MrNeRFvor 1 Jahr

Yeah, and the clip is super long.

Profilbild von Micky Abir

Micky Abirvor 1 Jahr

people don’t realize how huge this is

Profilbild von MrNeRF

MrNeRFvor 1 Jahr

long long videos, yeah!

Profilbild von TessyVFXR

TessyVFXRvor 1 Jahr

The fact that I can't get my head off this for the past few days... For me, it is that much needed tool that unlocks a lot.

Profilbild von James | 🤖

James | 🤖vor 1 Jahr

Awesome. Looking forward to trying this out!

Profilbild von MrNeRF

MrNeRFvor 1 Jahr

I'm crafting an email newsletter that turns my daily updates into a captivating weekly digest, complete with exclusive content. Although it's not live yet, you can sign up now! If you're curious, visit my website and join the subscriber list today!

Profilbild von Mars (parody)

Mars (parody)vor 1 Jahr

the future is beaming into reality gaaah this is so exciting

Profilbild von MrNeRF

MrNeRFvor 1 Jahr

Pretty good for monocular footage. The videos are also very long!

Ähnliche Videos

EnvGS: Modeling View-Dependent Appearance with Environment Gaussian Contributions: • We propose a novel scene representation for accurately modeling complex near-field and high-frequency reflections in real-world environments. • We developed a real-time ray-tracing renderer for 2DGS, enabling joint optimization of our representation for accurate scene reconstruction while achieving real-time rendering speeds. • Extensive experiments show that EnvGS significantly outperforms previous methods. To the best of our knowledge, EnvGS is the first method to achieve real-time photorealistic specular reflections synthesis in real-world scenes.

EnvGS: Modeling View-Dependent Appearance with Environment Gaussian Contributions: • We propose a novel scene representation for accurately modeling complex near-field and high-frequency reflections in real-world environments. • We developed a real-time ray-tracing renderer for 2DGS, enabling joint optimization of our representation for accurate scene reconstruction while achieving real-time rendering speeds. • Extensive experiments show that EnvGS significantly outperforms previous methods. To the best of our knowledge, EnvGS is the first method to achieve real-time photorealistic specular reflections synthesis in real-world scenes.

MrNeRF

44,650 Aufrufe • vor 1 Jahr

Gaussian Garments: Reconstructing Simulation-Ready Clothing with Photorealistic Appearance from Multi-View Video Contribution quote from the paper: In summary, our main contributions are • a comprehensive pipeline for reconstructing the shape, appearance, and behavior of real-world garments using Gaussian splatting, • an algorithm for registering garment meshes to multi- view videos with an optimization procedure based on Gaussian splatting, and • a Gaussian Garment representation that combines triangle meshes with Gaussian textures to capture photorealistic appearance and can be used as a fully controllable 3D asset.

Gaussian Garments: Reconstructing Simulation-Ready Clothing with Photorealistic Appearance from Multi-View Video Contribution quote from the paper: In summary, our main contributions are • a comprehensive pipeline for reconstructing the shape, appearance, and behavior of real-world garments using Gaussian splatting, • an algorithm for registering garment meshes to multi- view videos with an optimization procedure based on Gaussian splatting, and • a Gaussian Garment representation that combines triangle meshes with Gaussian textures to capture photorealistic appearance and can be used as a fully controllable 3D asset.

MrNeRF

27,277 Aufrufe • vor 1 Jahr

RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting Contributions: • We introduce a unified surface-volume Gaussian scene representation for jointly modeling sharp specular reflections and clear transmission in real-world scenes containing thin semi-transparent surfaces. • We propose Specular-Aware Gradient Gating to suppress misleading gradients from complex specular regions, substantially reducing floaters in the transmission branch. • Extensive experiments demonstrate that RT-Splatting significantly outperforms prior methods while maintaining real-time rendering and enabling flexible scene editing.

RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting Contributions: • We introduce a unified surface-volume Gaussian scene representation for jointly modeling sharp specular reflections and clear transmission in real-world scenes containing thin semi-transparent surfaces. • We propose Specular-Aware Gradient Gating to suppress misleading gradients from complex specular regions, substantially reducing floaters in the transmission branch. • Extensive experiments demonstrate that RT-Splatting significantly outperforms prior methods while maintaining real-time rendering and enabling flexible scene editing.

MrNeRF

28,278 Aufrufe • vor 2 Monaten

SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes abs: paper page: Existing methods for the 4D reconstruction of general, non-rigidly deforming objects focus on novel-view synthesis and neglect correspondences. However, time consistency enables advanced downstream tasks like 3D editing, motion analysis, or virtual-asset creation. We propose SceNeRFlow to reconstruct a general, non-rigid scene in a time-consistent manner. Our dynamic-NeRF method takes multi-view RGB videos and background images from static cameras with known camera parameters as input. It then reconstructs the deformations of an estimated canonical model of the geometry and appearance in an online fashion. Since this canonical model is time-invariant, we obtain correspondences even for long-term, long-range motions. We employ neural scene representations to parametrize the components of our method. Like prior dynamic-NeRF methods, we use a backwards deformation model. We find non-trivial adaptations of this model necessary to handle larger motions: We decompose the deformations into a strongly regularized coarse component and a weakly regularized fine component, where the coarse component also extends the deformation field into the space surrounding the object, which enables tracking over time. We show experimentally that, unlike prior work that only handles small motion, our method enables the reconstruction of studio-scale motions.

SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes abs: paper page: Existing methods for the 4D reconstruction of general, non-rigidly deforming objects focus on novel-view synthesis and neglect correspondences. However, time consistency enables advanced downstream tasks like 3D editing, motion analysis, or virtual-asset creation. We propose SceNeRFlow to reconstruct a general, non-rigid scene in a time-consistent manner. Our dynamic-NeRF method takes multi-view RGB videos and background images from static cameras with known camera parameters as input. It then reconstructs the deformations of an estimated canonical model of the geometry and appearance in an online fashion. Since this canonical model is time-invariant, we obtain correspondences even for long-term, long-range motions. We employ neural scene representations to parametrize the components of our method. Like prior dynamic-NeRF methods, we use a backwards deformation model. We find non-trivial adaptations of this model necessary to handle larger motions: We decompose the deformations into a strongly regularized coarse component and a weakly regularized fine component, where the coarse component also extends the deformation field into the space surrounding the object, which enables tracking over time. We show experimentally that, unlike prior work that only handles small motion, our method enables the reconstruction of studio-scale motions.

AK

76,380 Aufrufe • vor 2 Jahren

Fast View Synthesis of Casual Videos paper page: Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.

Fast View Synthesis of Casual Videos paper page: Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.

AK

20,668 Aufrufe • vor 2 Jahren

This seemingly obvious prediction didn't take long to become reality. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors Contributions: • The first real-time SLAM system using the two-view 3D reconstruction prior MASt3R [20] as a foundation. • Efficient techniques for pointmap matching, tracking and local fusion, graph construction and loop closure, and second-order global optimization. • A state-of-the-art dense SLAM system capable of handling generic, time-varying camera models. Abstract: We present a real-time monocular dense SLAM system, designed from the ground up using MASt3R, a two-view 3D reconstruction and matching prior. Equipped with this strong prior, our system remains robust on in-the-wild video sequences, making no assumptions on a fixed or parametric camera model beyond a unique camera center. Key features include: - Efficient methods for pointmap matching, camera tracking, and local fusion - Graph construction and loop closure - Second-order global optimization With known calibration, a simple modification achieves state-of-the-art performance across various benchmarks. Altogether, we propose a plug-and-play monocular SLAM system capable of producing globally-consistent poses and dense geometry while operating at 15 FPS.

This seemingly obvious prediction didn't take long to become reality. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors Contributions: • The first real-time SLAM system using the two-view 3D reconstruction prior MASt3R [20] as a foundation. • Efficient techniques for pointmap matching, tracking and local fusion, graph construction and loop closure, and second-order global optimization. • A state-of-the-art dense SLAM system capable of handling generic, time-varying camera models. Abstract: We present a real-time monocular dense SLAM system, designed from the ground up using MASt3R, a two-view 3D reconstruction and matching prior. Equipped with this strong prior, our system remains robust on in-the-wild video sequences, making no assumptions on a fixed or parametric camera model beyond a unique camera center. Key features include: - Efficient methods for pointmap matching, camera tracking, and local fusion - Graph construction and loop closure - Second-order global optimization With known calibration, a simple modification achieves state-of-the-art performance across various benchmarks. Altogether, we propose a plug-and-play monocular SLAM system capable of producing globally-consistent poses and dense geometry while operating at 15 FPS.

MrNeRF

29,961 Aufrufe • vor 1 Jahr

[SIGGRAPH 2025] Photoreal Scene Reconstruction from an Egocentric Device Contributions: 1. We address the importance of employing visual-inertial bundle adjustment (VIBA) that accounts for the rolling-shutter behavior of the RGB camera. This provides a continuous camera trajectory to model pixel movement in neural reconstruction. Our experiments demonstrate that using VIBA consistently improves the novel view quality in Gaussian Splatting by +1 dB in PSNR. 2. We introduce a rasterization-based image formulation pipeline that addresses common artifacts in physical image formation, including rolling shutter, lens shading, exposure, and gain compensation. Our approach is distinct in that we represent image poses as posed pixel arrays sampled from a continuous trajectory, rather than assigning a single camera pose per image, and preserve the merit of Gaussian rasterization. Unlike existing methods that require ray-tracing Gaussians, e.g., [Moenne-Loccoz et al. 2024], our formulation is applicable to general-purpose rasterization-based Gaussian splatting. When applied to 3D Gaussian Splatting (3DGS) [Kerbl et al. 2023], our approach can further enhance reconstruction quality by +1 dB. We outperform existing baselines and demonstrate a substantial quality improvement in handling complex scenes observed by egocentric devices. 3. To reduce the effect of blur from rapid head motion in darker indoor scenes, we propose a strategy of deliberately underexposing input videos during capture, inspired by HDR+ [Hasinoff et al. 2016]. We demonstrate that we can reconstruct high-quality, noise-free scene radiance from noisy, dim input videos, and further render sharp, blur-free videos at a higher dynamic range.

[SIGGRAPH 2025] Photoreal Scene Reconstruction from an Egocentric Device Contributions: 1. We address the importance of employing visual-inertial bundle adjustment (VIBA) that accounts for the rolling-shutter behavior of the RGB camera. This provides a continuous camera trajectory to model pixel movement in neural reconstruction. Our experiments demonstrate that using VIBA consistently improves the novel view quality in Gaussian Splatting by +1 dB in PSNR. 2. We introduce a rasterization-based image formulation pipeline that addresses common artifacts in physical image formation, including rolling shutter, lens shading, exposure, and gain compensation. Our approach is distinct in that we represent image poses as posed pixel arrays sampled from a continuous trajectory, rather than assigning a single camera pose per image, and preserve the merit of Gaussian rasterization. Unlike existing methods that require ray-tracing Gaussians, e.g., [Moenne-Loccoz et al. 2024], our formulation is applicable to general-purpose rasterization-based Gaussian splatting. When applied to 3D Gaussian Splatting (3DGS) [Kerbl et al. 2023], our approach can further enhance reconstruction quality by +1 dB. We outperform existing baselines and demonstrate a substantial quality improvement in handling complex scenes observed by egocentric devices. 3. To reduce the effect of blur from rapid head motion in darker indoor scenes, we propose a strategy of deliberately underexposing input videos during capture, inspired by HDR+ [Hasinoff et al. 2016]. We demonstrate that we can reconstruct high-quality, noise-free scene radiance from noisy, dim input videos, and further render sharp, blur-free videos at a higher dynamic range.

MrNeRF

15,244 Aufrufe • vor 1 Jahr

[SIGGRAPH Asia '24 (TOG)] Representing Long Volumetric Video with Temporal Gaussian Hierarchy Contributions: • We introduce a novel, efficient, and expressive Temporal Gaussian Hierarchy representation for long volumetric video. To our knowledge, our method is the first approach capable of handling minutes of volumetric video data. • We propose a Compact Appearance Model and a new rasterization implementation to facilitate real-time, high-quality dynamic view synthesis while maintaining a compact size. • We propose a system to efficiently model long volumetric videos for the first time and demonstrate state-of-the-art dynamic view synthesis quality on the Neural3DV [Li et al. 2022], ENeRF-Outdoor [Lin et al. 2022], and MobileStage [Xu et al. 2024b] datasets, while also achieving the best rendering speed with reduced training cost and memory usage.

[SIGGRAPH Asia '24 (TOG)] Representing Long Volumetric Video with Temporal Gaussian Hierarchy Contributions: • We introduce a novel, efficient, and expressive Temporal Gaussian Hierarchy representation for long volumetric video. To our knowledge, our method is the first approach capable of handling minutes of volumetric video data. • We propose a Compact Appearance Model and a new rasterization implementation to facilitate real-time, high-quality dynamic view synthesis while maintaining a compact size. • We propose a system to efficiently model long volumetric videos for the first time and demonstrate state-of-the-art dynamic view synthesis quality on the Neural3DV [Li et al. 2022], ENeRF-Outdoor [Lin et al. 2022], and MobileStage [Xu et al. 2024b] datasets, while also achieving the best rendering speed with reduced training cost and memory usage.

MrNeRF

79,379 Aufrufe • vor 1 Jahr

[SIGGRAPH '25] Monocular Online Reconstruction with Enhanced Detail Preservation Abstract (excerpt): Our approach addresses two key challenges in monocular online reconstruction: 1. Distributing Gaussians without relying on depth maps. 2. Ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: - Hierarchical Gaussian Management Module: For effective Gaussian distribution. - Global Consistency Optimization Module: For maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians to capture details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency.

[SIGGRAPH '25] Monocular Online Reconstruction with Enhanced Detail Preservation Abstract (excerpt): Our approach addresses two key challenges in monocular online reconstruction: 1. Distributing Gaussians without relying on depth maps. 2. Ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: - Hierarchical Gaussian Management Module: For effective Gaussian distribution. - Global Consistency Optimization Module: For maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians to capture details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency.

MrNeRF

23,638 Aufrufe • vor 1 Jahr

[NeurIPS '24] DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation Abstract (excerpt) We introduce DreamMesh4D, a novel framework that combines mesh representation with sparse-controlled deformation technique to generate high-quality 4D object from a monocular video. To overcome the limitation of classical texture representation, we bind Gaussian splats to the surface of the triangular mesh for differentiable optimization of both the texture and mesh vertices. In particular, DreamMesh4D begins with a coarse mesh provided by a single image based 3D generation method. Sparse points are then uniformly sampled across the surface of the mesh, and are used to build a deformation graph to drive the motion of the 3D object for the sake of computational efficiency and providing additional constraint. For each step, transformations of sparse control points are predicted using a deformation network, and the mesh vertices as well as the bound surface Gaussians are deformed via a geometric skinning algorithm. The skinning algorithm is a hybrid approach combining LBS (linear blending skinning) and DQS (dual-quaternion skinning), mitigating drawbacks associated with both approaches. The static surface Gaussians and mesh vertices as well as the dynamic deformation network are learned via reference view photometric loss, score distillation loss as well as other regularization losses in a two-stage manner. Extensive experiments demonstrate that our method outperforms prior video-to-4D generation methods in terms of rendering quality and spatial-temporal consistency.

[NeurIPS '24] DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation Abstract (excerpt) We introduce DreamMesh4D, a novel framework that combines mesh representation with sparse-controlled deformation technique to generate high-quality 4D object from a monocular video. To overcome the limitation of classical texture representation, we bind Gaussian splats to the surface of the triangular mesh for differentiable optimization of both the texture and mesh vertices. In particular, DreamMesh4D begins with a coarse mesh provided by a single image based 3D generation method. Sparse points are then uniformly sampled across the surface of the mesh, and are used to build a deformation graph to drive the motion of the 3D object for the sake of computational efficiency and providing additional constraint. For each step, transformations of sparse control points are predicted using a deformation network, and the mesh vertices as well as the bound surface Gaussians are deformed via a geometric skinning algorithm. The skinning algorithm is a hybrid approach combining LBS (linear blending skinning) and DQS (dual-quaternion skinning), mitigating drawbacks associated with both approaches. The static surface Gaussians and mesh vertices as well as the dynamic deformation network are learned via reference view photometric loss, score distillation loss as well as other regularization losses in a two-stage manner. Extensive experiments demonstrate that our method outperforms prior video-to-4D generation methods in terms of rendering quality and spatial-temporal consistency.

MrNeRF

12,323 Aufrufe • vor 1 Jahr

GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views TL;DR: Are we witnessing the first steps towards 3DGS live streaming? Contributions: • We introduce a generalizable 3D Gaussian Splatting methodology that employs pixel-wise Gaussian parameter maps defined on 2D source image planes to formulate 3D Gaussians in a feed-forward manner. • We propose a fully differentiable framework composed of an iterative depth estimation module and a Gaussian parameter regression module. The intermediate depth prediction bridges the two components and allows them to benefit from joint training. • We introduce a regularization term and an epipolar attention mechanism to preserve geometry consistency between the two source views when using only rendering loss. Our method generalizes well to unseen characters even in complicated scenes. • We develop a real-time FVV system that achieves high-resolution rendering of characters in the scene without any geometry supervision.

GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views TL;DR: Are we witnessing the first steps towards 3DGS live streaming? Contributions: • We introduce a generalizable 3D Gaussian Splatting methodology that employs pixel-wise Gaussian parameter maps defined on 2D source image planes to formulate 3D Gaussians in a feed-forward manner. • We propose a fully differentiable framework composed of an iterative depth estimation module and a Gaussian parameter regression module. The intermediate depth prediction bridges the two components and allows them to benefit from joint training. • We introduce a regularization term and an epipolar attention mechanism to preserve geometry consistency between the two source views when using only rendering loss. Our method generalizes well to unseen characters even in complicated scenes. • We develop a real-time FVV system that achieves high-resolution rendering of characters in the scene without any geometry supervision.

MrNeRF

25,862 Aufrufe • vor 1 Jahr

[SIGGRAPH '26] Anchored Temporal Gaussian Splatting for Long Volumetric Video Representation TL;DR: We present ATGS, a novel framework for volumetric video reconstruction that effectively handles long sequences and complex motions. By utilizing time-conditioned anchors and a temporal windowing strategy, ATGS enhances temporal coherence and scalability. Abstract (excerpt): Key insight is that explicitly tracking long term complex motion with individual Gaussian primitives is inherently unstable. Instead, we organize Gaussians around time conditioned anchors that localize their spatial and temporal support, thereby reducing long range motion complexity. We further introduce a temporal windowing strategy to activate only anchors relevant to the queried time, which improves scalability and temporal coherence. In addition, to ensure spatial and temporal stability, we design a compact set of multi level anchor features that encode global features, local spatial features, and local temporal features, jointly constraining Gaussian generation. Extensive experiments demonstrate that ATGS consistently outperforms prior methods on long sequence volumetric videos with complex motions.

[SIGGRAPH '26] Anchored Temporal Gaussian Splatting for Long Volumetric Video Representation TL;DR: We present ATGS, a novel framework for volumetric video reconstruction that effectively handles long sequences and complex motions. By utilizing time-conditioned anchors and a temporal windowing strategy, ATGS enhances temporal coherence and scalability. Abstract (excerpt): Key insight is that explicitly tracking long term complex motion with individual Gaussian primitives is inherently unstable. Instead, we organize Gaussians around time conditioned anchors that localize their spatial and temporal support, thereby reducing long range motion complexity. We further introduce a temporal windowing strategy to activate only anchors relevant to the queried time, which improves scalability and temporal coherence. In addition, to ensure spatial and temporal stability, we design a compact set of multi level anchor features that encode global features, local spatial features, and local temporal features, jointly constraining Gaussian generation. Extensive experiments demonstrate that ATGS consistently outperforms prior methods on long sequence volumetric videos with complex motions.

MrNeRF

26,905 Aufrufe • vor 3 Monaten

GS^3: Efficient Relighting with Triple Gaussian Splatting Abstract: We present a spatial and angular Gaussian based representation and a triple splatting process, for real-time, high-quality novel lighting-and-view synthesis from multi-view point-lit input images. To describe complex ap pearance, we employ a Lambertian plus a mixture of angular Gaussians as an effective reflectance function for each spatial Gaussian. To generate self-shadow, we splat all spatial Gaussians towards the light source to obtain shadow values, which are further refined by a small multi-layer perceptron. To compensate for other effects like global illumination, another network is trained to compute and add a per-spatial-Gaussian RGB tuple. The effectiveness of our representation is demonstrated on 30 samples with a wide variation in geometry (from solid to fluffy) and appearance (from translucent to anisotropic), as well as using different forms of input data, including rendered images of synthetic/reconstructed objects, photographs captured with a handheld camera and a flash, or from a professional lightstage. We achieve a training time of 40-70 minutes and a rendering speed of 90 fps on a single commodity GPU. Our results compare favorably with state-of-the-art techniques in terms of quality/performance.

GS^3: Efficient Relighting with Triple Gaussian Splatting Abstract: We present a spatial and angular Gaussian based representation and a triple splatting process, for real-time, high-quality novel lighting-and-view synthesis from multi-view point-lit input images. To describe complex ap pearance, we employ a Lambertian plus a mixture of angular Gaussians as an effective reflectance function for each spatial Gaussian. To generate self-shadow, we splat all spatial Gaussians towards the light source to obtain shadow values, which are further refined by a small multi-layer perceptron. To compensate for other effects like global illumination, another network is trained to compute and add a per-spatial-Gaussian RGB tuple. The effectiveness of our representation is demonstrated on 30 samples with a wide variation in geometry (from solid to fluffy) and appearance (from translucent to anisotropic), as well as using different forms of input data, including rendered images of synthetic/reconstructed objects, photographs captured with a handheld camera and a flash, or from a professional lightstage. We achieve a training time of 40-70 minutes and a rendering speed of 90 fps on a single commodity GPU. Our results compare favorably with state-of-the-art techniques in terms of quality/performance.

MrNeRF

17,786 Aufrufe • vor 1 Jahr

Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Contributions: • We propose STORM, the first feed-forward, self-supervised method for fast and accurate reconstruction of dynamic 3D scenes from sparse, multi-timestep, posed camera images. • Our bottom-up framework aggregates and transforms per-frame 3D Gaussian Splats into a cohesive scene representation, enabling self-supervised motion estimation. Furthermore, we introduce motion tokens that capture common motion primitives and regularize motion predictions, facilitating dynamic motion group segmentation without explicit motion or correspondence supervision. • We present several enhancements for in-the-wild scenarios, including sky modeling, camera exposure inconsistency handling, large novel-view extrapolation, and fine-grained human motions reconstruction, making STORM well-suited for real-world applications.

Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Contributions: • We propose STORM, the first feed-forward, self-supervised method for fast and accurate reconstruction of dynamic 3D scenes from sparse, multi-timestep, posed camera images. • Our bottom-up framework aggregates and transforms per-frame 3D Gaussian Splats into a cohesive scene representation, enabling self-supervised motion estimation. Furthermore, we introduce motion tokens that capture common motion primitives and regularize motion predictions, facilitating dynamic motion group segmentation without explicit motion or correspondence supervision. • We present several enhancements for in-the-wild scenarios, including sky modeling, camera exposure inconsistency handling, large novel-view extrapolation, and fine-grained human motions reconstruction, making STORM well-suited for real-world applications.

MrNeRF

53,292 Aufrufe • vor 1 Jahr

China open-sourced a model that reconstructs any scene in 3D from a regular video, in real-time. one camera. no LiDAR. 10,000+ frames without falling apart. just walk around with your camera and watch the entire world get rebuilt in 3D at 20 fps. → runs at ~20 FPS on a single GPU → Stable over 10,000+ frames → Beats optimization-based methods on benchmarks → Works on drone footage, driving videos, indoor walkthroughs 100% open source.

China open-sourced a model that reconstructs any scene in 3D from a regular video, in real-time. one camera. no LiDAR. 10,000+ frames without falling apart. just walk around with your camera and watch the entire world get rebuilt in 3D at 20 fps. → runs at ~20 FPS on a single GPU → Stable over 10,000+ frames → Beats optimization-based methods on benchmarks → Works on drone footage, driving videos, indoor walkthroughs 100% open source.

Yasir Ai

251,249 Aufrufe • vor 6 Tagen

China open-sourced a model that reconstructs any scene in 3D from a regular video, in real-time. one camera. no LiDAR. 10,000+ frames without falling apart. just walk around with your camera and watch the entire world get rebuilt in 3D at 20 fps. → runs at ~20 FPS on a single GPU → Stable over 10,000+ frames → Beats optimization-based methods on benchmarks → Works on drone footage, driving videos, indoor walkthroughs 100% open source.

China open-sourced a model that reconstructs any scene in 3D from a regular video, in real-time. one camera. no LiDAR. 10,000+ frames without falling apart. just walk around with your camera and watch the entire world get rebuilt in 3D at 20 fps. → runs at ~20 FPS on a single GPU → Stable over 10,000+ frames → Beats optimization-based methods on benchmarks → Works on drone footage, driving videos, indoor walkthroughs 100% open source.

Superman

1,428,746 Aufrufe • vor 8 Tagen

A breakthrough in real-time video generation. As a research preview developed with NVIDIA and shared at NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

A breakthrough in real-time video generation. As a research preview developed with NVIDIA and shared at NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

Runway

1,162,775 Aufrufe • vor 4 Monaten

SqueezeMe: Efficient Gaussian Avatars for VR TL;DR: Three of these Gaussian Splatting avatars can be run at 72 frames per second. It runs locally on a Meta Quest 3 VR headset. Abstract (excerpt): While previous methods require a desktop GPU for real-time inference of a single avatar, we aim to squeeze multiple Gaussian avatars onto a portable virtual reality headset with real-time drivable inference. We begin by training a previous work, Animatable Gaussians, on a high-quality dataset captured with 512 cameras. The Gaussians are animated by controlling a base set of Gaussians with linear blend skinning (LBS) motion, and then further adjusting them with a neural network decoder to correct their appearance. When deploying the model on a Meta Quest 3 VR headset, we find two major computational bottlenecks: the decoder and the rendering. To accelerate the decoder, we train the Gaussians in UV-space instead of pixel-space and distill the decoder to a single neural network layer. Further, we discover that neighborhoods of Gaussians can share a single corrective from the decoder, providing an additional speedup. To accelerate the rendering, we develop a custom pipeline in Vulkan that runs on the mobile GPU. Putting it all together, we run 3 Gaussian avatars concurrently at 72 FPS on a VR headset.

MrNeRF

27,104 Aufrufe • vor 1 Jahr

VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams paper page: Neural Radiance Fields (NeRFs) excel in photorealistically rendering static scenes. However, rendering dynamic, long-duration radiance fields on ubiquitous devices remains challenging, due to data storage and computational constraints. In this paper, we introduce VideoRF, the first approach to enable real-time streaming and rendering of dynamic radiance fields on mobile platforms. At the core is a serialized 2D feature image stream representing the 4D radiance field all in one. We introduce a tailored training scheme directly applied to this 2D domain to impose the temporal and spatial redundancy of the feature image stream. By leveraging the redundancy, we show that the feature image stream can be efficiently compressed by 2D video codecs, which allows us to exploit video hardware accelerators to achieve real-time decoding. On the other hand, based on the feature image stream, we propose a novel rendering pipeline for VideoRF, which has specialized space mappings to query radiance properties efficiently. Paired with a deferred shading model, VideoRF has the capability of real-time rendering on mobile devices thanks to its efficiency. We have developed a real-time interactive player that enables online streaming and rendering of dynamic scenes, offering a seamless and immersive free-viewpoint experience across a range of devices, from desktops to mobile phones.

VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams paper page: Neural Radiance Fields (NeRFs) excel in photorealistically rendering static scenes. However, rendering dynamic, long-duration radiance fields on ubiquitous devices remains challenging, due to data storage and computational constraints. In this paper, we introduce VideoRF, the first approach to enable real-time streaming and rendering of dynamic radiance fields on mobile platforms. At the core is a serialized 2D feature image stream representing the 4D radiance field all in one. We introduce a tailored training scheme directly applied to this 2D domain to impose the temporal and spatial redundancy of the feature image stream. By leveraging the redundancy, we show that the feature image stream can be efficiently compressed by 2D video codecs, which allows us to exploit video hardware accelerators to achieve real-time decoding. On the other hand, based on the feature image stream, we propose a novel rendering pipeline for VideoRF, which has specialized space mappings to query radiance properties efficiently. Paired with a deferred shading model, VideoRF has the capability of real-time rendering on mobile devices thanks to its efficiency. We have developed a real-time interactive player that enables online streaming and rendering of dynamic scenes, offering a seamless and immersive free-viewpoint experience across a range of devices, from desktops to mobile phones.

AK

38,686 Aufrufe • vor 2 Jahren