Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

URAvatar: Universal Relightable Gaussian Codec Avatars Contributions (cited): (1) We introduce a universal relightable avatar prior model learned from hundreds of dynamic performance captures with a multi-view and multi-light system. (2) We build a drivable head avatar from a phone scan that can be rendered and relit with global... show more

MrNeRF

16,801 subscribers

50,088 views • 1 year ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

7 Comments

MrNeRF1 year ago

Paper: Project:

(Alex) Compositing Academy1 year ago

Genuinely believe this is the most underrated tech right now, this will completely change communication & work.

MrNeRF1 year ago

I totally agree! Not many know about it outside this bubble here.

Non Believer1 year ago

No code released?

Reeva 🇺🇸1 year ago

"Mind-blown by the advancements in avatar technology! The potential for immersive experiences just took a giant leap forward!"

Rigaku ryōhō1 year ago

Amazing.

BeastTitanHunter1 year ago

@OpenAI @midjourney @HeyGen_Official @hedra_labs

Related Videos

Relightable Full-Body Gaussian Codec Avatars TL;DR: First drivable full-body avatar model that reconstructs perceptually realistic relightable appearance. Contributions: • We propose the first relightable full-body avatar model that jointly models the relightable appearance of the human body, face, and hands for high-fidelity relighting and animation. • To handle full-body articulations with global light transport, we propose learnable zonal harmonics to represent local diffuse radiance transfer in the local coordinate frames of each Gaussian. This results in a reduced number of parameters and improved rendering quality compared to the commonly used spherical harmonics representation. • We reformulate the learnable radiance transfer to explicitly decompose non-local shadowing and propose a dedicated shadow network to predict shadows caused by the articulation of the body. Additionally, we propose a physically based irradiance normalization scheme to ensure that the shadow network can generalize to novel illumination conditions, such as unseen environment maps. • We show that deferred shading can be used for our learned specular radiance transfer function, achieving high-fidelity specular reflections for relightable human avatar modeling without excessively increasing the number of Gaussians.

MrNeRF

10,966 views • 1 year ago

📢BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading We propose a hybrid neural shading scheme for creating intrinsically decomposed 3DGS head avatars, that allow real-time relighting and animation. 🌍 📷

📢BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading We propose a hybrid neural shading scheme for creating intrinsically decomposed 3DGS head avatars, that allow real-time relighting and animation. 🌍 📷

Matthias Niessner

23,027 views • 1 year ago

[SIGGRAPH '25] EVA: Expressive Virtual Avatars from Multi-view Videos Contributions: 1. We introduce EVA, a novel method enabling full-body control with real-time, photo-realistic renderings, robustly handling loose clothing dynamics and various facial expressions. 2. We develop an expressive deformable template that generates a deformable human template mesh and employs a multi-stage tracking algorithm to faithfully capture facial expressions, body motions, and non-rigid deformations from multi-view videos. 3. We propose a disentangled 3D Gaussian appearance module that models the body and face independently, ensuring separated control and high-quality renderings.

[SIGGRAPH '25] EVA: Expressive Virtual Avatars from Multi-view Videos Contributions: 1. We introduce EVA, a novel method enabling full-body control with real-time, photo-realistic renderings, robustly handling loose clothing dynamics and various facial expressions. 2. We develop an expressive deformable template that generates a deformable human template mesh and employs a multi-stage tracking algorithm to faithfully capture facial expressions, body motions, and non-rigid deformations from multi-view videos. 3. We propose a disentangled 3D Gaussian appearance module that models the body and face independently, ensuring separated control and high-quality renderings.

MrNeRF

18,407 views • 1 year ago

Want to create an avatar from a single image? FlexAvatar is a transformer model that creates full 360°, high-quality, and expressive 3D head avatar from just a single portrait image in minutes. Real-time Demo: FlexAvatar's lightweight architecture allows both animation and rendering in real-time, enabling interactive user experiences. To create a new 3D head avatar, only one image is required, e.g., from a webcam. The final avatar is ready after 2 minutes. Architecture: Under the hood, FlexAvatar adopts a transformer-based encoder-decoder design. The encoder maps the input image onto a latent avatar space, while the decoder produces 3D Gaussian attribute maps by incorporating the animation signal via cross-attention. The model learns all facial animations directly from the data without relying on pre-built 3D face models. This equips the avatars with realistic facial expressions. The internal avatar latent space can be conveniently used to integrate additional observations of a person via fitting. This enables use-cases where more than one image of a person is available, e.g., from a phone scan of the person. We train jointly on 2D monocular videos and multi-view data. However, in monocular videos, the animation signal leaks the target viewpoint, causing the model to produce incomplete 3D heads. We call this phenomenon entanglement of driving signal and target viewpoint. To prevent entanglement, we introduce bias sinks. These are learnable tokens that indicate whether a training sample stems from a monocular or a multi-view dataset. During training, the model learns to produce incomplete 3D heads only when the monocular token is present. During inference, FlexAvatar then always uses the multi-view token for which the model has learned to produce complete 3D heads. This simple design allows to combine the generalizability from monocular data with the quality of multi-view data. FlexAvatar summary: - Input: Single-image, phone scan, or monocular video - Output: Full 360° head avatar - Expressive animations - Real-time rendering and animation - Generalization to any portrait - Create a new avatar in 2 minutes - Use bias sinks to combine 2D and 3D data 🏠 🌍 🎥 Great work by Tobias Kirschstein and Simon Giebenhain!

Matthias Niessner

95,991 views • 7 months ago

Gaussian Garments: Reconstructing Simulation-Ready Clothing with Photorealistic Appearance from Multi-View Video Contribution quote from the paper: In summary, our main contributions are • a comprehensive pipeline for reconstructing the shape, appearance, and behavior of real-world garments using Gaussian splatting, • an algorithm for registering garment meshes to multi- view videos with an optimization procedure based on Gaussian splatting, and • a Gaussian Garment representation that combines triangle meshes with Gaussian textures to capture photorealistic appearance and can be used as a fully controllable 3D asset.

Gaussian Garments: Reconstructing Simulation-Ready Clothing with Photorealistic Appearance from Multi-View Video Contribution quote from the paper: In summary, our main contributions are • a comprehensive pipeline for reconstructing the shape, appearance, and behavior of real-world garments using Gaussian splatting, • an algorithm for registering garment meshes to multi- view videos with an optimization procedure based on Gaussian splatting, and • a Gaussian Garment representation that combines triangle meshes with Gaussian textures to capture photorealistic appearance and can be used as a fully controllable 3D asset.

MrNeRF

27,277 views • 1 year ago

GS^3: Efficient Relighting with Triple Gaussian Splatting Abstract: We present a spatial and angular Gaussian based representation and a triple splatting process, for real-time, high-quality novel lighting-and-view synthesis from multi-view point-lit input images. To describe complex ap pearance, we employ a Lambertian plus a mixture of angular Gaussians as an effective reflectance function for each spatial Gaussian. To generate self-shadow, we splat all spatial Gaussians towards the light source to obtain shadow values, which are further refined by a small multi-layer perceptron. To compensate for other effects like global illumination, another network is trained to compute and add a per-spatial-Gaussian RGB tuple. The effectiveness of our representation is demonstrated on 30 samples with a wide variation in geometry (from solid to fluffy) and appearance (from translucent to anisotropic), as well as using different forms of input data, including rendered images of synthetic/reconstructed objects, photographs captured with a handheld camera and a flash, or from a professional lightstage. We achieve a training time of 40-70 minutes and a rendering speed of 90 fps on a single commodity GPU. Our results compare favorably with state-of-the-art techniques in terms of quality/performance.

GS^3: Efficient Relighting with Triple Gaussian Splatting Abstract: We present a spatial and angular Gaussian based representation and a triple splatting process, for real-time, high-quality novel lighting-and-view synthesis from multi-view point-lit input images. To describe complex ap pearance, we employ a Lambertian plus a mixture of angular Gaussians as an effective reflectance function for each spatial Gaussian. To generate self-shadow, we splat all spatial Gaussians towards the light source to obtain shadow values, which are further refined by a small multi-layer perceptron. To compensate for other effects like global illumination, another network is trained to compute and add a per-spatial-Gaussian RGB tuple. The effectiveness of our representation is demonstrated on 30 samples with a wide variation in geometry (from solid to fluffy) and appearance (from translucent to anisotropic), as well as using different forms of input data, including rendered images of synthetic/reconstructed objects, photographs captured with a handheld camera and a flash, or from a professional lightstage. We achieve a training time of 40-70 minutes and a rendering speed of 90 fps on a single commodity GPU. Our results compare favorably with state-of-the-art techniques in terms of quality/performance.

MrNeRF

17,786 views • 1 year ago

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 views • 2 years ago

SqueezeMe: Efficient Gaussian Avatars for VR TL;DR: Three of these Gaussian Splatting avatars can be run at 72 frames per second. It runs locally on a Meta Quest 3 VR headset. Abstract (excerpt): While previous methods require a desktop GPU for real-time inference of a single avatar, we aim to squeeze multiple Gaussian avatars onto a portable virtual reality headset with real-time drivable inference. We begin by training a previous work, Animatable Gaussians, on a high-quality dataset captured with 512 cameras. The Gaussians are animated by controlling a base set of Gaussians with linear blend skinning (LBS) motion, and then further adjusting them with a neural network decoder to correct their appearance. When deploying the model on a Meta Quest 3 VR headset, we find two major computational bottlenecks: the decoder and the rendering. To accelerate the decoder, we train the Gaussians in UV-space instead of pixel-space and distill the decoder to a single neural network layer. Further, we discover that neighborhoods of Gaussians can share a single corrective from the decoder, providing an additional speedup. To accelerate the rendering, we develop a custom pipeline in Vulkan that runs on the mobile GPU. Putting it all together, we run 3 Gaussian avatars concurrently at 72 FPS on a VR headset.

MrNeRF

27,104 views • 1 year ago

MaterialFusion Enhancing Inverse Rendering with Material Diffusion Priors discuss: Recent works in inverse rendering have shown promise in using multi-view images of an object to recover shape, albedo, and materials. However, the recovered components often fail to render accurately under new lighting conditions due to the intrinsic challenge of disentangling albedo and material properties from input images. To address this challenge, we introduce MaterialFusion, an enhanced conventional 3D inverse rendering pipeline that incorporates a 2D prior on texture and material properties. We present StableMaterial, a 2D diffusion model prior that refines multi-lit data to estimate the most likely albedo and material from given input appearances. This model is trained on albedo, material, and relit image data derived from a curated dataset of approximately ~12K artist-designed synthetic Blender objects called BlenderVault. we incorporate this diffusion prior with an inverse rendering framework where we use score distillation sampling (SDS) to guide the optimization of the albedo and materials, improving relighting performance in comparison with previous work. We validate MaterialFusion's relighting performance on 4 datasets of synthetic and real objects under diverse illumination conditions, showing our diffusion-aided approach significantly improves the appearance of reconstructed objects under novel lighting conditions. We intend to publicly release our BlenderVault dataset to support further research in this field.

MaterialFusion Enhancing Inverse Rendering with Material Diffusion Priors discuss: Recent works in inverse rendering have shown promise in using multi-view images of an object to recover shape, albedo, and materials. However, the recovered components often fail to render accurately under new lighting conditions due to the intrinsic challenge of disentangling albedo and material properties from input images. To address this challenge, we introduce MaterialFusion, an enhanced conventional 3D inverse rendering pipeline that incorporates a 2D prior on texture and material properties. We present StableMaterial, a 2D diffusion model prior that refines multi-lit data to estimate the most likely albedo and material from given input appearances. This model is trained on albedo, material, and relit image data derived from a curated dataset of approximately ~12K artist-designed synthetic Blender objects called BlenderVault. we incorporate this diffusion prior with an inverse rendering framework where we use score distillation sampling (SDS) to guide the optimization of the albedo and materials, improving relighting performance in comparison with previous work. We validate MaterialFusion's relighting performance on 4 datasets of synthetic and real objects under diverse illumination conditions, showing our diffusion-aided approach significantly improves the appearance of reconstructed objects under novel lighting conditions. We intend to publicly release our BlenderVault dataset to support further research in this field.

AK

22,959 views • 1 year ago

📢New research from our group “Personalized Video Relighting with an At-Home Light Stage” (1/3) We show how to leverage screen lighting as an 'at-home Light Stage' and develop a personalized relighting model. We can now replace your background and relight your faces to match it!

📢New research from our group “Personalized Video Relighting with an At-Home Light Stage” (1/3) We show how to leverage screen lighting as an 'at-home Light Stage' and develop a personalized relighting model. We can now replace your background and relight your faces to match it!

Roni Sengupta

25,221 views • 2 years ago

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 views • 11 months ago

This seemingly obvious prediction didn't take long to become reality. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors Contributions: • The first real-time SLAM system using the two-view 3D reconstruction prior MASt3R [20] as a foundation. • Efficient techniques for pointmap matching, tracking and local fusion, graph construction and loop closure, and second-order global optimization. • A state-of-the-art dense SLAM system capable of handling generic, time-varying camera models. Abstract: We present a real-time monocular dense SLAM system, designed from the ground up using MASt3R, a two-view 3D reconstruction and matching prior. Equipped with this strong prior, our system remains robust on in-the-wild video sequences, making no assumptions on a fixed or parametric camera model beyond a unique camera center. Key features include: - Efficient methods for pointmap matching, camera tracking, and local fusion - Graph construction and loop closure - Second-order global optimization With known calibration, a simple modification achieves state-of-the-art performance across various benchmarks. Altogether, we propose a plug-and-play monocular SLAM system capable of producing globally-consistent poses and dense geometry while operating at 15 FPS.

This seemingly obvious prediction didn't take long to become reality. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors Contributions: • The first real-time SLAM system using the two-view 3D reconstruction prior MASt3R [20] as a foundation. • Efficient techniques for pointmap matching, tracking and local fusion, graph construction and loop closure, and second-order global optimization. • A state-of-the-art dense SLAM system capable of handling generic, time-varying camera models. Abstract: We present a real-time monocular dense SLAM system, designed from the ground up using MASt3R, a two-view 3D reconstruction and matching prior. Equipped with this strong prior, our system remains robust on in-the-wild video sequences, making no assumptions on a fixed or parametric camera model beyond a unique camera center. Key features include: - Efficient methods for pointmap matching, camera tracking, and local fusion - Graph construction and loop closure - Second-order global optimization With known calibration, a simple modification achieves state-of-the-art performance across various benchmarks. Altogether, we propose a plug-and-play monocular SLAM system capable of producing globally-consistent poses and dense geometry while operating at 15 FPS.

MrNeRF

29,961 views • 1 year ago

[SIGGRAPH Asia '24 (TOG)] Representing Long Volumetric Video with Temporal Gaussian Hierarchy Contributions: • We introduce a novel, efficient, and expressive Temporal Gaussian Hierarchy representation for long volumetric video. To our knowledge, our method is the first approach capable of handling minutes of volumetric video data. • We propose a Compact Appearance Model and a new rasterization implementation to facilitate real-time, high-quality dynamic view synthesis while maintaining a compact size. • We propose a system to efficiently model long volumetric videos for the first time and demonstrate state-of-the-art dynamic view synthesis quality on the Neural3DV [Li et al. 2022], ENeRF-Outdoor [Lin et al. 2022], and MobileStage [Xu et al. 2024b] datasets, while also achieving the best rendering speed with reduced training cost and memory usage.

[SIGGRAPH Asia '24 (TOG)] Representing Long Volumetric Video with Temporal Gaussian Hierarchy Contributions: • We introduce a novel, efficient, and expressive Temporal Gaussian Hierarchy representation for long volumetric video. To our knowledge, our method is the first approach capable of handling minutes of volumetric video data. • We propose a Compact Appearance Model and a new rasterization implementation to facilitate real-time, high-quality dynamic view synthesis while maintaining a compact size. • We propose a system to efficiently model long volumetric videos for the first time and demonstrate state-of-the-art dynamic view synthesis quality on the Neural3DV [Li et al. 2022], ENeRF-Outdoor [Lin et al. 2022], and MobileStage [Xu et al. 2024b] datasets, while also achieving the best rendering speed with reduced training cost and memory usage.

MrNeRF

79,379 views • 1 year ago

We Neo Sigma Ritvik Kapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

We Neo Sigma Ritvik Kapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

Gauri Gupta

95,481 views • 4 months ago

Drivable 3D Gaussian Avatars paper page: present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable model for human bodies rendered with Gaussian splats. Current photorealistic drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both. The ones based on neural radiance fields also tend to be prohibitively slow for telepresence applications. This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates, using dense calibrated multi-view videos as input. To deform those primitives, we depart from the commonly used point deformation method of linear blend skinning (LBS) and use a classic volumetric deformation method: cage deformations. Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications. Our experiments on nine subjects with varied body shapes, clothes, and motions obtain higher-quality results than state-of-the-art methods when using the same training and test data.

Drivable 3D Gaussian Avatars paper page: present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable model for human bodies rendered with Gaussian splats. Current photorealistic drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both. The ones based on neural radiance fields also tend to be prohibitively slow for telepresence applications. This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates, using dense calibrated multi-view videos as input. To deform those primitives, we depart from the commonly used point deformation method of linear blend skinning (LBS) and use a classic volumetric deformation method: cage deformations. Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications. Our experiments on nine subjects with varied body shapes, clothes, and motions obtain higher-quality results than state-of-the-art methods when using the same training and test data.

AK

327,105 views • 2 years ago

GaussianSpeech: Audio-Driven Gaussian Avatars Contributions: • The first transformer-based sequence model for audio-driven head animation synthesis of a lightweight 3DGS based avatar. By animating our optimized 3DGS avatar directly with our transformer model, we achieve temporally coherent animation sequences while characterizing fine-scale face details and speaker-specific style. • A new high-quality audio-video dataset, comprising high-resolution 16-view dataset of 6 native English speakers (Standard American & British). The dataset has a total of 2500 sequences, with overall recordings of ∼3.5 hours.

MrNeRF

11,371 views • 1 year ago

Say hello 👋 to our Instant Avatar (Avatar 2.0) - the most advanced avatar technology in the market. Users can now create their own custom avatars in under 5 minutes ⏰ with just a phone or laptop. Oh, and did we mention it was FREE?! 🎉 Get started:

HeyGen

72,555 views • 2 years ago

Pensions minister says there's no prospect of a return to universal winter fuel payments Torsten Bell told MPs - 'It’s not a good idea that we have a system paying a few hundreds of pounds to millionaires, and so we’re not going to be continuing with that'

Pensions minister says there's no prospect of a return to universal winter fuel payments Torsten Bell told MPs - 'It’s not a good idea that we have a system paying a few hundreds of pounds to millionaires, and so we’re not going to be continuing with that'

Peter Stefanovic

38,141 views • 1 year ago

MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction TL;DR: The first RGB-only multi-agent 3D Gaussian Splatting SLAM for collaborative photorealistic scene reconstruction. Contributions: (1) We propose the first monocular RGB-only multi-agent 3D Gaussian Splatting SLAM system. It integrates Gaussian front-ends, compact submap summaries, inter-agent verification, Sim(3) submap pose graph, and occupancy-aware fusion into a unified framework, achieving accurate tracking and photorealistic reconstruction without depth sensors. (2) We propose a Pose-Graph Bundle Adjustment (PGBA)-consistent Sim(3) loop closure mechanism for multi-agent systems, which jointly resolves intra- and inter-agent scale drift through a submap-level Sim(3) pose graph coupling geometric and photometric residuals. Robustness is ensured by a spatial-extent gate that rejects degenerate loops and an adaptive edge invalidation scheme consistent with evolving PGBA corrections. (3) We propose an occupancy-aware fusion framework for coherent multi-agent Gaussian maps. It combines occupancy-grid deduplication, decoupled coordinator, and joint pose-Gaussian photometric refinement to eliminate duplicated Gaussians, residual misalignment, and photometric seams across agents. (4) We introduce ReplicaMultiagent Plus dataset. While existing multi-agent datasets are typically limited to 2-3 agents with short trajectories, our dataset scales to 4 agents with long-horizon trajectories. In addition, we provide ground-truth geometry and semantic annotations, supporting the evaluation of monocular, RGB-D, and semantic multi-agent SLAM for collaborative dense reconstruction.

MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction TL;DR: The first RGB-only multi-agent 3D Gaussian Splatting SLAM for collaborative photorealistic scene reconstruction. Contributions: (1) We propose the first monocular RGB-only multi-agent 3D Gaussian Splatting SLAM system. It integrates Gaussian front-ends, compact submap summaries, inter-agent verification, Sim(3) submap pose graph, and occupancy-aware fusion into a unified framework, achieving accurate tracking and photorealistic reconstruction without depth sensors. (2) We propose a Pose-Graph Bundle Adjustment (PGBA)-consistent Sim(3) loop closure mechanism for multi-agent systems, which jointly resolves intra- and inter-agent scale drift through a submap-level Sim(3) pose graph coupling geometric and photometric residuals. Robustness is ensured by a spatial-extent gate that rejects degenerate loops and an adaptive edge invalidation scheme consistent with evolving PGBA corrections. (3) We propose an occupancy-aware fusion framework for coherent multi-agent Gaussian maps. It combines occupancy-grid deduplication, decoupled coordinator, and joint pose-Gaussian photometric refinement to eliminate duplicated Gaussians, residual misalignment, and photometric seams across agents. (4) We introduce ReplicaMultiagent Plus dataset. While existing multi-agent datasets are typically limited to 2-3 agents with short trajectories, our dataset scales to 4 agents with long-horizon trajectories. In addition, we provide ground-truth geometry and semantic annotations, supporting the evaluation of monocular, RGB-D, and semantic multi-agent SLAM for collaborative dense reconstruction.

MrNeRF

19,357 views • 2 months ago