正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

[SIGGRAPH '25] TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling Note: On the left that's a 3DGS rendering! Contributions: 1. We propose a simple approach for rigging 3D Gaussians within the continuous tangent space of 3DMM face models, allowing Gaussians to move freely across mesh triangles. 2.... show more

MrNeRF

16,769 subscribers

29,010 次观看 • 1 年前 •via X (Twitter)

艺术科学技术教育

Anya Rossi• Live Now

Private livecam show

7 条评论

MrNeRF 的头像

MrNeRF1 年前

Paper: Project:

Reji Modiyil 的头像

Reji Modiyil1 年前

@waitin4agi_ @waitin4agi_, impressive developments in dynamic head modeling. the future looks bright in avatar technology.

Tibo on Tech 的头像

Tibo on Tech1 年前

TeGA seems to be pushing the boundaries of what's possible in dynamic head modeling.

Sean Brynjólfsson 的头像

Sean Brynjólfsson1 年前

“Note: On the left that’s a 3DGS rendering!” 🤯

Yash Chopda 的头像

Yash Chopda1 年前

@waitin4agi_ This sounds like a fascinating development in head modeling.

Mohammed Lubbad, PhD 的头像

Mohammed Lubbad, PhD1 年前

@waitin4agi_ Incorporating realistic textures into head modeling could revolutionize virtual interaction. What other advancements might transform user experiences? 🌐 #Innovation

Stefan Larsen 的头像

Stefan Larsen1 年前

No speaking?

相关视频

Nvidia announces GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning paper page: Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper, we seek to leverage Gaussian splatting to generate realistic animatable avatars from textual descriptions, addressing the limitations (e.g., flexibility and efficiency) imposed by mesh or NeRF-based representations. However, a naive application of Gaussian splatting cannot generate high-quality animatable avatars and suffers from learning instability; it also cannot capture fine avatar geometries and often leads to degenerate body parts. To tackle these problems, we first propose a primitive-based 3D Gaussian representation where Gaussians are defined inside pose-driven primitives to facilitate animation. Second, to stabilize and amortize the learning of millions of Gaussians, we propose to use neural implicit fields to predict the Gaussian attributes (e.g., colors). Finally, to capture fine avatar geometries and extract detailed meshes, we propose a novel SDF-based implicit mesh learning approach for 3D Gaussians that regularizes the underlying geometries and extracts highly detailed textured meshes. Our proposed method, GAvatar, enables the large-scale generation of diverse animatable avatars using only text prompts. GAvatar significantly surpasses existing methods in terms of both appearance and geometry quality, and achieves extremely fast rendering (100 fps) at 1K resolution.

AK

140,992 次观看 • 2 年前

ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions (#SIGGRAPH) We reconstruct ultra-high fidelity photorealistic 3D avatars capable of generating realistic and high-quality animations including freckles and other fine facial details. We operate on patch-based local expression features and increase the representation capacity by synthesizing 3D Gaussians dynamically by leveraging tiny scaffold MLPs conditioned on localized expressions. We further propose a color-based densification and progressive training scheme for improved quality and faster convergence. Project: Video: Great work by Shivangi, Sebastian Weiss, Irene Baeza, Prashanth Chandran, Gaspard Zoss, Derek Bradley

ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions (#SIGGRAPH) We reconstruct ultra-high fidelity photorealistic 3D avatars capable of generating realistic and high-quality animations including freckles and other fine facial details. We operate on patch-based local expression features and increase the representation capacity by synthesizing 3D Gaussians dynamically by leveraging tiny scaffold MLPs conditioned on localized expressions. We further propose a color-based densification and progressive training scheme for improved quality and faster convergence. Project: Video: Great work by Shivangi, Sebastian Weiss, Irene Baeza, Prashanth Chandran, Gaspard Zoss, Derek Bradley

Matthias Niessner

17,232 次观看 • 11 个月前

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Contributions: • We propose SuperGSeg: a 3D segmentation method with neural Gaussians, designed to learn hierarchical instance segmentation features from 2D foundation models. • We introduce the concept of Super-Gaussian, a novel representation that integrates hierarchical instance segmentation features, enabling the embedding of high-dimensional language features. This approach addresses previously unfeasible challenges in representing complex scenes with rich semantic details. • Extensive experiments on the LERF-OVS and ScanNet datasets demonstrate the effectiveness of the proposed method, achieving significant improvements in open-vocabulary 3D object-level and scene-level semantic segmentation. It shows particular strength in capturing fine-grained scene details and dense pixel semantic segmentation tasks for the first time.

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Contributions: • We propose SuperGSeg: a 3D segmentation method with neural Gaussians, designed to learn hierarchical instance segmentation features from 2D foundation models. • We introduce the concept of Super-Gaussian, a novel representation that integrates hierarchical instance segmentation features, enabling the embedding of high-dimensional language features. This approach addresses previously unfeasible challenges in representing complex scenes with rich semantic details. • Extensive experiments on the LERF-OVS and ScanNet datasets demonstrate the effectiveness of the proposed method, achieving significant improvements in open-vocabulary 3D object-level and scene-level semantic segmentation. It shows particular strength in capturing fine-grained scene details and dense pixel semantic segmentation tasks for the first time.

MrNeRF

13,594 次观看 • 1 年前

[NeurIPS '24] DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation Abstract (excerpt) We introduce DreamMesh4D, a novel framework that combines mesh representation with sparse-controlled deformation technique to generate high-quality 4D object from a monocular video. To overcome the limitation of classical texture representation, we bind Gaussian splats to the surface of the triangular mesh for differentiable optimization of both the texture and mesh vertices. In particular, DreamMesh4D begins with a coarse mesh provided by a single image based 3D generation method. Sparse points are then uniformly sampled across the surface of the mesh, and are used to build a deformation graph to drive the motion of the 3D object for the sake of computational efficiency and providing additional constraint. For each step, transformations of sparse control points are predicted using a deformation network, and the mesh vertices as well as the bound surface Gaussians are deformed via a geometric skinning algorithm. The skinning algorithm is a hybrid approach combining LBS (linear blending skinning) and DQS (dual-quaternion skinning), mitigating drawbacks associated with both approaches. The static surface Gaussians and mesh vertices as well as the dynamic deformation network are learned via reference view photometric loss, score distillation loss as well as other regularization losses in a two-stage manner. Extensive experiments demonstrate that our method outperforms prior video-to-4D generation methods in terms of rendering quality and spatial-temporal consistency.

[NeurIPS '24] DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation Abstract (excerpt) We introduce DreamMesh4D, a novel framework that combines mesh representation with sparse-controlled deformation technique to generate high-quality 4D object from a monocular video. To overcome the limitation of classical texture representation, we bind Gaussian splats to the surface of the triangular mesh for differentiable optimization of both the texture and mesh vertices. In particular, DreamMesh4D begins with a coarse mesh provided by a single image based 3D generation method. Sparse points are then uniformly sampled across the surface of the mesh, and are used to build a deformation graph to drive the motion of the 3D object for the sake of computational efficiency and providing additional constraint. For each step, transformations of sparse control points are predicted using a deformation network, and the mesh vertices as well as the bound surface Gaussians are deformed via a geometric skinning algorithm. The skinning algorithm is a hybrid approach combining LBS (linear blending skinning) and DQS (dual-quaternion skinning), mitigating drawbacks associated with both approaches. The static surface Gaussians and mesh vertices as well as the dynamic deformation network are learned via reference view photometric loss, score distillation loss as well as other regularization losses in a two-stage manner. Extensive experiments demonstrate that our method outperforms prior video-to-4D generation methods in terms of rendering quality and spatial-temporal consistency.

MrNeRF

12,323 次观看 • 1 年前

SqueezeMe: Efficient Gaussian Avatars for VR TL;DR: Three of these Gaussian Splatting avatars can be run at 72 frames per second. It runs locally on a Meta Quest 3 VR headset. Abstract (excerpt): While previous methods require a desktop GPU for real-time inference of a single avatar, we aim to squeeze multiple Gaussian avatars onto a portable virtual reality headset with real-time drivable inference. We begin by training a previous work, Animatable Gaussians, on a high-quality dataset captured with 512 cameras. The Gaussians are animated by controlling a base set of Gaussians with linear blend skinning (LBS) motion, and then further adjusting them with a neural network decoder to correct their appearance. When deploying the model on a Meta Quest 3 VR headset, we find two major computational bottlenecks: the decoder and the rendering. To accelerate the decoder, we train the Gaussians in UV-space instead of pixel-space and distill the decoder to a single neural network layer. Further, we discover that neighborhoods of Gaussians can share a single corrective from the decoder, providing an additional speedup. To accelerate the rendering, we develop a custom pipeline in Vulkan that runs on the mobile GPU. Putting it all together, we run 3 Gaussian avatars concurrently at 72 FPS on a VR headset.

MrNeRF

27,104 次观看 • 1 年前

Relightable Full-Body Gaussian Codec Avatars TL;DR: First drivable full-body avatar model that reconstructs perceptually realistic relightable appearance. Contributions: • We propose the first relightable full-body avatar model that jointly models the relightable appearance of the human body, face, and hands for high-fidelity relighting and animation. • To handle full-body articulations with global light transport, we propose learnable zonal harmonics to represent local diffuse radiance transfer in the local coordinate frames of each Gaussian. This results in a reduced number of parameters and improved rendering quality compared to the commonly used spherical harmonics representation. • We reformulate the learnable radiance transfer to explicitly decompose non-local shadowing and propose a dedicated shadow network to predict shadows caused by the articulation of the body. Additionally, we propose a physically based irradiance normalization scheme to ensure that the shadow network can generalize to novel illumination conditions, such as unseen environment maps. • We show that deferred shading can be used for our learned specular radiance transfer function, achieving high-fidelity specular reflections for relightable human avatar modeling without excessively increasing the number of Gaussians.

MrNeRF

10,966 次观看 • 1 年前

Extract accurate depth from your 3D gaussians! Fast, accurate, and easy to use. Code of RaDe-GS is released! project page: code: paper: No need to change your 3D GS representations, we propose a play-and-plug rasterizer to extract accurate depth and normal from 3D gaussians, leading to the possibility of high-quality geometry reconstruction. It's quite efficient and fast, achieving 0.69mm errors on DTU dataset in only 5 minutes. Very easy to adopt the rasterizer in your any 3dgs projects.

Extract accurate depth from your 3D gaussians! Fast, accurate, and easy to use. Code of RaDe-GS is released! project page: code: paper: No need to change your 3D GS representations, we propose a play-and-plug rasterizer to extract accurate depth and normal from 3D gaussians, leading to the possibility of high-quality geometry reconstruction. It's quite efficient and fast, achieving 0.69mm errors on DTU dataset in only 5 minutes. Very easy to adopt the rasterizer in your any 3dgs projects.

Xiao-Xiao Long

31,305 次观看 • 2 年前

[SIGGRAPH 2025] Photoreal Scene Reconstruction from an Egocentric Device Contributions: 1. We address the importance of employing visual-inertial bundle adjustment (VIBA) that accounts for the rolling-shutter behavior of the RGB camera. This provides a continuous camera trajectory to model pixel movement in neural reconstruction. Our experiments demonstrate that using VIBA consistently improves the novel view quality in Gaussian Splatting by +1 dB in PSNR. 2. We introduce a rasterization-based image formulation pipeline that addresses common artifacts in physical image formation, including rolling shutter, lens shading, exposure, and gain compensation. Our approach is distinct in that we represent image poses as posed pixel arrays sampled from a continuous trajectory, rather than assigning a single camera pose per image, and preserve the merit of Gaussian rasterization. Unlike existing methods that require ray-tracing Gaussians, e.g., [Moenne-Loccoz et al. 2024], our formulation is applicable to general-purpose rasterization-based Gaussian splatting. When applied to 3D Gaussian Splatting (3DGS) [Kerbl et al. 2023], our approach can further enhance reconstruction quality by +1 dB. We outperform existing baselines and demonstrate a substantial quality improvement in handling complex scenes observed by egocentric devices. 3. To reduce the effect of blur from rapid head motion in darker indoor scenes, we propose a strategy of deliberately underexposing input videos during capture, inspired by HDR+ [Hasinoff et al. 2016]. We demonstrate that we can reconstruct high-quality, noise-free scene radiance from noisy, dim input videos, and further render sharp, blur-free videos at a higher dynamic range.

[SIGGRAPH 2025] Photoreal Scene Reconstruction from an Egocentric Device Contributions: 1. We address the importance of employing visual-inertial bundle adjustment (VIBA) that accounts for the rolling-shutter behavior of the RGB camera. This provides a continuous camera trajectory to model pixel movement in neural reconstruction. Our experiments demonstrate that using VIBA consistently improves the novel view quality in Gaussian Splatting by +1 dB in PSNR. 2. We introduce a rasterization-based image formulation pipeline that addresses common artifacts in physical image formation, including rolling shutter, lens shading, exposure, and gain compensation. Our approach is distinct in that we represent image poses as posed pixel arrays sampled from a continuous trajectory, rather than assigning a single camera pose per image, and preserve the merit of Gaussian rasterization. Unlike existing methods that require ray-tracing Gaussians, e.g., [Moenne-Loccoz et al. 2024], our formulation is applicable to general-purpose rasterization-based Gaussian splatting. When applied to 3D Gaussian Splatting (3DGS) [Kerbl et al. 2023], our approach can further enhance reconstruction quality by +1 dB. We outperform existing baselines and demonstrate a substantial quality improvement in handling complex scenes observed by egocentric devices. 3. To reduce the effect of blur from rapid head motion in darker indoor scenes, we propose a strategy of deliberately underexposing input videos during capture, inspired by HDR+ [Hasinoff et al. 2016]. We demonstrate that we can reconstruct high-quality, noise-free scene radiance from noisy, dim input videos, and further render sharp, blur-free videos at a higher dynamic range.

MrNeRF

15,244 次观看 • 1 年前

[SIGGRAPH Asia '24 (TOG)] Representing Long Volumetric Video with Temporal Gaussian Hierarchy Contributions: • We introduce a novel, efficient, and expressive Temporal Gaussian Hierarchy representation for long volumetric video. To our knowledge, our method is the first approach capable of handling minutes of volumetric video data. • We propose a Compact Appearance Model and a new rasterization implementation to facilitate real-time, high-quality dynamic view synthesis while maintaining a compact size. • We propose a system to efficiently model long volumetric videos for the first time and demonstrate state-of-the-art dynamic view synthesis quality on the Neural3DV [Li et al. 2022], ENeRF-Outdoor [Lin et al. 2022], and MobileStage [Xu et al. 2024b] datasets, while also achieving the best rendering speed with reduced training cost and memory usage.

[SIGGRAPH Asia '24 (TOG)] Representing Long Volumetric Video with Temporal Gaussian Hierarchy Contributions: • We introduce a novel, efficient, and expressive Temporal Gaussian Hierarchy representation for long volumetric video. To our knowledge, our method is the first approach capable of handling minutes of volumetric video data. • We propose a Compact Appearance Model and a new rasterization implementation to facilitate real-time, high-quality dynamic view synthesis while maintaining a compact size. • We propose a system to efficiently model long volumetric videos for the first time and demonstrate state-of-the-art dynamic view synthesis quality on the Neural3DV [Li et al. 2022], ENeRF-Outdoor [Lin et al. 2022], and MobileStage [Xu et al. 2024b] datasets, while also achieving the best rendering speed with reduced training cost and memory usage.

MrNeRF

79,379 次观看 • 1 年前

EnvGS: Modeling View-Dependent Appearance with Environment Gaussian Contributions: • We propose a novel scene representation for accurately modeling complex near-field and high-frequency reflections in real-world environments. • We developed a real-time ray-tracing renderer for 2DGS, enabling joint optimization of our representation for accurate scene reconstruction while achieving real-time rendering speeds. • Extensive experiments show that EnvGS significantly outperforms previous methods. To the best of our knowledge, EnvGS is the first method to achieve real-time photorealistic specular reflections synthesis in real-world scenes.

EnvGS: Modeling View-Dependent Appearance with Environment Gaussian Contributions: • We propose a novel scene representation for accurately modeling complex near-field and high-frequency reflections in real-world environments. • We developed a real-time ray-tracing renderer for 2DGS, enabling joint optimization of our representation for accurate scene reconstruction while achieving real-time rendering speeds. • Extensive experiments show that EnvGS significantly outperforms previous methods. To the best of our knowledge, EnvGS is the first method to achieve real-time photorealistic specular reflections synthesis in real-world scenes.

MrNeRF

44,650 次观看 • 1 年前

GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views TL;DR: Are we witnessing the first steps towards 3DGS live streaming? Contributions: • We introduce a generalizable 3D Gaussian Splatting methodology that employs pixel-wise Gaussian parameter maps defined on 2D source image planes to formulate 3D Gaussians in a feed-forward manner. • We propose a fully differentiable framework composed of an iterative depth estimation module and a Gaussian parameter regression module. The intermediate depth prediction bridges the two components and allows them to benefit from joint training. • We introduce a regularization term and an epipolar attention mechanism to preserve geometry consistency between the two source views when using only rendering loss. Our method generalizes well to unseen characters even in complicated scenes. • We develop a real-time FVV system that achieves high-resolution rendering of characters in the scene without any geometry supervision.

GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views TL;DR: Are we witnessing the first steps towards 3DGS live streaming? Contributions: • We introduce a generalizable 3D Gaussian Splatting methodology that employs pixel-wise Gaussian parameter maps defined on 2D source image planes to formulate 3D Gaussians in a feed-forward manner. • We propose a fully differentiable framework composed of an iterative depth estimation module and a Gaussian parameter regression module. The intermediate depth prediction bridges the two components and allows them to benefit from joint training. • We introduce a regularization term and an epipolar attention mechanism to preserve geometry consistency between the two source views when using only rendering loss. Our method generalizes well to unseen characters even in complicated scenes. • We develop a real-time FVV system that achieves high-resolution rendering of characters in the scene without any geometry supervision.

MrNeRF

25,862 次观看 • 1 年前

📢📢 𝐀𝐯𝐚𝐭𝟑𝐫 📢📢 Avat3r creates high-quality 3D head avatars from just a few input images in a single forward pass with a new dynamic 3DGS reconstruction model. Video: Project: Our core idea is to make Gaussian Reconstruction Models animatable. We find that a simple cross-attention to an expression code sequence is already sufficient to model complex facial expressions. We then incorporate position maps from DUSt3R and feature maps from Sapiens to facilitate the prediction task. While DUSt3R's position maps act as a pixel-aligned initialization for the Gaussians' positions, the Sapiens feature maps help the cross-view transformer to match corresponding image tokens in the 4 input images. One major challenge in creating a 3D head avatar from smartphone images comes from inconsistent facial expressions when the subject could not remain perfectly static during the capture. We eliminate this static requirement by simply showing our model input images with different facial expressions during training. This technique makes our model robust to inconsistent input images later on. Finally, we show that despite the model has been trained with 4 input images, one can even create a 3D head avatar when only a single image is available. To achieve this, we employ a pre-trained 3D GAN to lift the single image to 3D and then render the 4 input images for our model. This allows us to create 3D head avatars from single images and even highly out-of-distribution examples like AI generated faces, paintings or statues. Great work by Tobias Kirschstein from his internship at Meta with Javier Romero, Artem Sevastopolsky, and Shunsuke Saito

Matthias Niessner

74,763 次观看 • 1 年前

[SIGGRAPH '25] EVA: Expressive Virtual Avatars from Multi-view Videos Contributions: 1. We introduce EVA, a novel method enabling full-body control with real-time, photo-realistic renderings, robustly handling loose clothing dynamics and various facial expressions. 2. We develop an expressive deformable template that generates a deformable human template mesh and employs a multi-stage tracking algorithm to faithfully capture facial expressions, body motions, and non-rigid deformations from multi-view videos. 3. We propose a disentangled 3D Gaussian appearance module that models the body and face independently, ensuring separated control and high-quality renderings.

[SIGGRAPH '25] EVA: Expressive Virtual Avatars from Multi-view Videos Contributions: 1. We introduce EVA, a novel method enabling full-body control with real-time, photo-realistic renderings, robustly handling loose clothing dynamics and various facial expressions. 2. We develop an expressive deformable template that generates a deformable human template mesh and employs a multi-stage tracking algorithm to faithfully capture facial expressions, body motions, and non-rigid deformations from multi-view videos. 3. We propose a disentangled 3D Gaussian appearance module that models the body and face independently, ensuring separated control and high-quality renderings.

MrNeRF

18,407 次观看 • 1 年前

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors paper page: present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images.

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors paper page: present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images.

AK

305,663 次观看 • 3 年前

MeshSplatting: Differentiable Rendering with Opaque Meshes Contributions: (i) An end-to-end optimization of mesh-based scene representations retains visual quality while training 2× faster than current state-of-the-art methods. (ii) Rather than a polygon soup, we generate a connected mesh by refining the vertex locations of a restricted Delaunay triangulation. (iii) Triangles are naturally connected to each other, and quantities stored within vertices are smoothly interpolated across each triangle. (iv) The optimization is aware that the triangles should be opaque, allowing direct high-quality rendering in standard game engines (see Fig. 1), opening the door for classical techniques like the use of depth buffers and occlusion culling [1, 22].

MeshSplatting: Differentiable Rendering with Opaque Meshes Contributions: (i) An end-to-end optimization of mesh-based scene representations retains visual quality while training 2× faster than current state-of-the-art methods. (ii) Rather than a polygon soup, we generate a connected mesh by refining the vertex locations of a restricted Delaunay triangulation. (iii) Triangles are naturally connected to each other, and quantities stored within vertices are smoothly interpolated across each triangle. (iv) The optimization is aware that the triangles should be opaque, allowing direct high-quality rendering in standard game engines (see Fig. 1), opening the door for classical techniques like the use of depth buffers and occlusion culling [1, 22].

MrNeRF

15,322 次观看 • 7 个月前

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos Abstract: We propose 4DGT, a 4D Gaussian-based Transformer model for dynamic scene reconstruction, trained entirely on real-world monocular posed videos. Using 4D Gaussian as an inductive bias, 4DGT unifies static and dynamic components, enabling the modeling of complex, time-varying environments with varying object lifespans. We introduced a novel density control strategy in training, which allows our 4DGT to handle longer space-time input while maintaining efficient rendering at runtime. Our model processes 64 consecutive posed frames in a rolling-window fashion, predicting consistent 4D Gaussians in the scene. Unlike optimization-based methods, 4DGT performs purely feed-forward inference, reducing reconstruction time from hours to seconds and scaling effectively to long video sequences. Trained only on large-scale monocular posed video datasets, 4DGT can significantly outperform prior Gaussian-based networks in real-world videos and achieve on-par accuracy with optimization-based methods on cross-domain videos.

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos Abstract: We propose 4DGT, a 4D Gaussian-based Transformer model for dynamic scene reconstruction, trained entirely on real-world monocular posed videos. Using 4D Gaussian as an inductive bias, 4DGT unifies static and dynamic components, enabling the modeling of complex, time-varying environments with varying object lifespans. We introduced a novel density control strategy in training, which allows our 4DGT to handle longer space-time input while maintaining efficient rendering at runtime. Our model processes 64 consecutive posed frames in a rolling-window fashion, predicting consistent 4D Gaussians in the scene. Unlike optimization-based methods, 4DGT performs purely feed-forward inference, reducing reconstruction time from hours to seconds and scaling effectively to long video sequences. Trained only on large-scale monocular posed video datasets, 4DGT can significantly outperform prior Gaussian-based networks in real-world videos and achieve on-par accuracy with optimization-based methods on cross-domain videos.

MrNeRF

34,782 次观看 • 1 年前

[SIGGRAPH '25] Monocular Online Reconstruction with Enhanced Detail Preservation Abstract (excerpt): Our approach addresses two key challenges in monocular online reconstruction: 1. Distributing Gaussians without relying on depth maps. 2. Ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: - Hierarchical Gaussian Management Module: For effective Gaussian distribution. - Global Consistency Optimization Module: For maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians to capture details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency.

[SIGGRAPH '25] Monocular Online Reconstruction with Enhanced Detail Preservation Abstract (excerpt): Our approach addresses two key challenges in monocular online reconstruction: 1. Distributing Gaussians without relying on depth maps. 2. Ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: - Hierarchical Gaussian Management Module: For effective Gaussian distribution. - Global Consistency Optimization Module: For maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians to capture details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency.

MrNeRF

23,638 次观看 • 1 年前

📢 LinPrim: Linear Primitives for Differentiable Volumetric Rendering 📢 We use octahedra or tetrahedra as explicit as volumetric building blocks for gradient-based novel view synthesis - as an alternative to 3D Gaussians with discrete, bounded geometry. We show how it can be used to reconstruct photorealistic scenes, and introduce a corresponding differentiable CUDA rasterizer that enables real-time rendering. On real-world scenes, LinPrim achieves comparable image quality with fewer primitives, adding a practical polyhedral option to the 3D scene representation toolbox and expanding the known design space. 🌍 🎥 Amazing work by Nicolas von Lützow!

📢 LinPrim: Linear Primitives for Differentiable Volumetric Rendering 📢 We use octahedra or tetrahedra as explicit as volumetric building blocks for gradient-based novel view synthesis - as an alternative to 3D Gaussians with discrete, bounded geometry. We show how it can be used to reconstruct photorealistic scenes, and introduce a corresponding differentiable CUDA rasterizer that enables real-time rendering. On real-world scenes, LinPrim achieves comparable image quality with fewer primitives, adding a practical polyhedral option to the 3D scene representation toolbox and expanding the known design space. 🌍 🎥 Amazing work by Nicolas von Lützow!

Matthias Niessner

11,413 次观看 • 1 年前