Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

VGGT: Visual Geometry Grounded Transformer TL;DR: Is DUSt3R facing a formidable new rival? Contributions: (1) We introduce VGGT, a large feed-forward transformer that can, given one, a few, or even hundreds of images of a scene, predict all its key 3D attributes - including camera intrinsics and extrinsics, point... maps, depth maps, and 3D point tracks - in seconds. (2) We demonstrate that VGGT’s predictions are directly usable, being highly competitive and usually better than those of state-of-the-art methods that use slow post-processing optimization techniques. (3) We also show that when further combined with BA post-processing, VGGT achieves state-of-the-art results across the board, even when compared to methods that specialize in a subset of 3D tasks, often improving quality substantially.show more

MrNeRF

14,174 subscribers

29,461 просмотров • 1 год назад •via X (Twitter)

Образование Наука и технологии Искусство

Anya Rossi• Live Now

Private livecam show

Комментарии: 12

Фото профиля MrNeRF

MrNeRF1 год назад

Paper (pdf): Code:

Фото профиля MrNeRF

MrNeRF1 год назад

Thanks for bringing this paper to my attention!

Фото профиля MrNeRF

MrNeRF1 год назад

I'm crafting an email newsletter that turns my daily updates into a captivating weekly digest, complete with exclusive content. Although it's not live yet, you can sign up now! If you're curious, visit my website and join the subscriber list today!

Фото профиля MrNeRF

MrNeRF1 год назад

Original author's post:

Фото профиля OPEN

OPEN2 лет назад

Introducing OPEN, the first genre-defining AAA metaverse gaming experience with top-tier IP powered by web3 technology. Coming to @thereadyverse. #opensoon

Фото профиля Pablo Vela

Pablo Vela1 год назад

Gah looks so cool, still not MIT/Apache 😭😭

Фото профиля MrNeRF

MrNeRF1 год назад

Yeah, but it is nice to see someone breaking into this monopoly which is good!

Фото профиля Abdullah Hamdi

Abdullah Hamdi1 год назад

Our VGG group

Фото профиля Jianyuan Wang

Jianyuan Wang1 год назад

Thanks for sharing! We released it in a silent mode for a while but was quickly caught lol

Фото профиля MrNeRF

MrNeRF1 год назад

The silence is over :D. Awesome paper, thank you!

Фото профиля Sir Mr Meow Meow

Sir Mr Meow Meow1 год назад

interesting

Фото профиля MrNeRF

MrNeRF1 год назад

Yes, quite impressive!

Похожие видео

Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds! No expensive optimization needed, yet delivers SOTA results for: ✅ Camera Pose Estimation ✅ Multi-view Depth Estimation ✅ Dense Point Cloud Reconstruction ✅ Point Tracking Project Page: Code & Weights:

Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds! No expensive optimization needed, yet delivers SOTA results for: ✅ Camera Pose Estimation ✅ Multi-view Depth Estimation ✅ Dense Point Cloud Reconstruction ✅ Point Tracking Project Page: Code & Weights:

Jianyuan

203,226 просмотров • 1 год назад

Want high-quality 3D meshes with sharp geometric details? Try our newly released MeshFormer! It only takes 8 GPUs for two days of training, outperforming state-of-the-art models that use over a hundred GPUs! With 3D-native input guidance, representations, supervision, and post-processing, we significantly improve the training efficiency and geometric quality of feed-forward reconstruction models! Project page: Chong Zeng

Want high-quality 3D meshes with sharp geometric details? Try our newly released MeshFormer! It only takes 8 GPUs for two days of training, outperforming state-of-the-art models that use over a hundred GPUs! With 3D-native input guidance, representations, supervision, and post-processing, we significantly improve the training efficiency and geometric quality of feed-forward reconstruction models! Project page: Chong Zeng

Minghua Liu

16,340 просмотров • 1 год назад

Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art results 🚀 One universal model enables SoTA for: 🔥 Mono Depth Estimation 🔥 Multi-View SfM 🔥 Multi-View Stereo 🔥 Depth Completion 🔥 Registration … and many more possibilities! – plus everything is metric 🎯 We release code for data processing, training, benchmarking & ablations – everything Apache 2.0! Details & Links 👇

Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art results 🚀 One universal model enables SoTA for: 🔥 Mono Depth Estimation 🔥 Multi-View SfM 🔥 Multi-View Stereo 🔥 Depth Completion 🔥 Registration … and many more possibilities! – plus everything is metric 🎯 We release code for data processing, training, benchmarking & ablations – everything Apache 2.0! Details & Links 👇

Nikhil Keetha

122,891 просмотров • 10 месяцев назад

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,708 просмотров • 3 лет назад

Excited to share this demo from Over the Reality! Watch our Unitree Go2 robodog navigating our office while reconstructing the 3D space in real-time using the VGGT-based foundation vision model. This is a prime example of machine perception in action, turning raw RGB camera feeds into rich, detailed 3D maps! The robodog's RGB cam generates a dense, textured 3D reconstruction via VGGT from a few photograms, capturing nuances like object shapes and surfaces with impressive fidelity (main view). Compare that to the standard LiDAR system (top right), it's sparser, more point-cloud focused, lacking the visual richness. Vision models are closing the gap fast! What's powering this? VGGT, a cutting-edge foundation model for 3D perception, trained on datasets orders of magnitude smaller than our massive OVER 3D maps dataset. Imagine the leap when we apply OVER's scale to VGGT-like transformer based architectures, denser reconstructions, better generalization, revolutionary for robotics, machine perception & AR! Stay tuned for more breakthroughs at the intersection of AI, robotics, and DePIN. We're building the future of Physical AI and Spatial Computing at Over the Reality 🌐 What do you think, ready for robodogs in your world? Drop your thoughts! 🤖🌐

Excited to share this demo from Over the Reality! Watch our Unitree Go2 robodog navigating our office while reconstructing the 3D space in real-time using the VGGT-based foundation vision model. This is a prime example of machine perception in action, turning raw RGB camera feeds into rich, detailed 3D maps! The robodog's RGB cam generates a dense, textured 3D reconstruction via VGGT from a few photograms, capturing nuances like object shapes and surfaces with impressive fidelity (main view). Compare that to the standard LiDAR system (top right), it's sparser, more point-cloud focused, lacking the visual richness. Vision models are closing the gap fast! What's powering this? VGGT, a cutting-edge foundation model for 3D perception, trained on datasets orders of magnitude smaller than our massive OVER 3D maps dataset. Imagine the leap when we apply OVER's scale to VGGT-like transformer based architectures, denser reconstructions, better generalization, revolutionary for robotics, machine perception & AR! Stay tuned for more breakthroughs at the intersection of AI, robotics, and DePIN. We're building the future of Physical AI and Spatial Computing at Over the Reality 🌐 What do you think, ready for robodogs in your world? Drop your thoughts! 🤖🌐

Over the Reality 🌐

359,717 просмотров • 10 месяцев назад

GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views TL;DR: Are we witnessing the first steps towards 3DGS live streaming? Contributions: • We introduce a generalizable 3D Gaussian Splatting methodology that employs pixel-wise Gaussian parameter maps defined on 2D source image planes to formulate 3D Gaussians in a feed-forward manner. • We propose a fully differentiable framework composed of an iterative depth estimation module and a Gaussian parameter regression module. The intermediate depth prediction bridges the two components and allows them to benefit from joint training. • We introduce a regularization term and an epipolar attention mechanism to preserve geometry consistency between the two source views when using only rendering loss. Our method generalizes well to unseen characters even in complicated scenes. • We develop a real-time FVV system that achieves high-resolution rendering of characters in the scene without any geometry supervision.

GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views TL;DR: Are we witnessing the first steps towards 3DGS live streaming? Contributions: • We introduce a generalizable 3D Gaussian Splatting methodology that employs pixel-wise Gaussian parameter maps defined on 2D source image planes to formulate 3D Gaussians in a feed-forward manner. • We propose a fully differentiable framework composed of an iterative depth estimation module and a Gaussian parameter regression module. The intermediate depth prediction bridges the two components and allows them to benefit from joint training. • We introduce a regularization term and an epipolar attention mechanism to preserve geometry consistency between the two source views when using only rendering loss. Our method generalizes well to unseen characters even in complicated scenes. • We develop a real-time FVV system that achieves high-resolution rendering of characters in the scene without any geometry supervision.

MrNeRF

25,862 просмотров • 1 год назад

Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives paper page: Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations.

Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives paper page: Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations.

AK

38,571 просмотров • 3 лет назад

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,736 просмотров • 1 год назад

MeshSplatting: Differentiable Rendering with Opaque Meshes Contributions: (i) An end-to-end optimization of mesh-based scene representations retains visual quality while training 2× faster than current state-of-the-art methods. (ii) Rather than a polygon soup, we generate a connected mesh by refining the vertex locations of a restricted Delaunay triangulation. (iii) Triangles are naturally connected to each other, and quantities stored within vertices are smoothly interpolated across each triangle. (iv) The optimization is aware that the triangles should be opaque, allowing direct high-quality rendering in standard game engines (see Fig. 1), opening the door for classical techniques like the use of depth buffers and occlusion culling [1, 22].

MeshSplatting: Differentiable Rendering with Opaque Meshes Contributions: (i) An end-to-end optimization of mesh-based scene representations retains visual quality while training 2× faster than current state-of-the-art methods. (ii) Rather than a polygon soup, we generate a connected mesh by refining the vertex locations of a restricted Delaunay triangulation. (iii) Triangles are naturally connected to each other, and quantities stored within vertices are smoothly interpolated across each triangle. (iv) The optimization is aware that the triangles should be opaque, allowing direct high-quality rendering in standard game engines (see Fig. 1), opening the door for classical techniques like the use of depth buffers and occlusion culling [1, 22].

MrNeRF

15,322 просмотров • 7 месяцев назад

Big thanks to AK for highlighting our work! LEO marks our pioneering step towards building an embodied generalist agent that can really comprehend the 3D world! 🚀Leveraging LLMs, we train LEO with real and synthetic 3D data across a diverse spectrum of tasks. It's thrilling to see LEO surpass current state-of-the-art SOTA methods in most benchmarked tasks, all under a single, unified model. 🔥 #Generalist_Agent

Big thanks to AK for highlighting our work! LEO marks our pioneering step towards building an embodied generalist agent that can really comprehend the 3D world! 🚀Leveraging LLMs, we train LEO with real and synthetic 3D data across a diverse spectrum of tasks. It's thrilling to see LEO surpass current state-of-the-art SOTA methods in most benchmarked tasks, all under a single, unified model. 🔥 #Generalist_Agent

Siyuan Huang

22,710 просмотров • 2 лет назад

🍺 LagerNVS (CVPR 2026) 🍺 LagerNVS is a generalizable, feed-forward, real-time Novel View Synthesis network which - performs rendering in real time, - generalizes to in-the-wild data, - works with and without known source cameras, - sets a new state-of-the-art among deterministic methods, - can be paired with a diffusion decoder for generative extrapolation. LagerNVS shows that 3D biases are useful for Novel View Synthesis but explicit 3D representations are not required to achieve them. We use 3D biases in (1) architecture design and (2) pre-training: (1) In NVS with explicit 3D representations (3DGS, NeRF) reconstruction is typically difficult and slow, but rendering is much faster and simpler. We mimic this process in the network design: we use a large (1B params) encoder and a small, lightweight decoder (ViT-B). This allows increasing the network capacity while still achieving real-time rendering. (2) The encoder, initialized from VGGT, was pre-trained with 3D reconstruction objectives, making the initial features 3D aware. Both substantially improve performance. Project page: Code: Paper: Models: Work done with Jianyuan Minghao Chen Christian Rupprecht and Andrea Vedaldi

🍺 LagerNVS (CVPR 2026) 🍺 LagerNVS is a generalizable, feed-forward, real-time Novel View Synthesis network which - performs rendering in real time, - generalizes to in-the-wild data, - works with and without known source cameras, - sets a new state-of-the-art among deterministic methods, - can be paired with a diffusion decoder for generative extrapolation. LagerNVS shows that 3D biases are useful for Novel View Synthesis but explicit 3D representations are not required to achieve them. We use 3D biases in (1) architecture design and (2) pre-training: (1) In NVS with explicit 3D representations (3DGS, NeRF) reconstruction is typically difficult and slow, but rendering is much faster and simpler. We mimic this process in the network design: we use a large (1B params) encoder and a small, lightweight decoder (ViT-B). This allows increasing the network capacity while still achieving real-time rendering. (2) The encoder, initialized from VGGT, was pre-trained with 3D reconstruction objectives, making the initial features 3D aware. Both substantially improve performance. Project page: Code: Paper: Models: Work done with Jianyuan Minghao Chen Christian Rupprecht and Andrea Vedaldi

Stan Szymanowicz

31,651 просмотров • 4 месяцев назад

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

AK

161,530 просмотров • 2 лет назад

📣 New research from GenAI at Meta, introducing Meta 3D Gen: A new system for end-to-end generation of 3D assets from text in <1min. Meta 3D Gen is a new combined AI system that can generate high-quality 3D assets, with both high-resolution textures and material maps end-to-end, producing results that are superior to existing solutions — at 3-10x the speed of existing work in this space. Details in the technical report ➡️

📣 New research from GenAI at Meta, introducing Meta 3D Gen: A new system for end-to-end generation of 3D assets from text in <1min. Meta 3D Gen is a new combined AI system that can generate high-quality 3D assets, with both high-resolution textures and material maps end-to-end, producing results that are superior to existing solutions — at 3-10x the speed of existing work in this space. Details in the technical report ➡️

AI at Meta

408,809 просмотров • 2 лет назад

Spatial reconstruction is a long-context problem: real scenes come with hundreds of images. But O(N²) transformer-based models don’t scale efficiently. Introducing: 🤐ZipMap (CVPR ’26): Linear-Time, Stateful 3D Reconstruction via Test-Time Training (TTT). ZipMap “zips” a large image collection into an implicit TTT scene state in a single linear-time operation. The state will then be decoded into spatial outputs, and can be queried efficiently for novel-view geometry and appearance (~100 FPS) ZipMap is not only much faster (>20× faster than VGGT), but also matches or surpasses the accuracy of all SOTA models.

Spatial reconstruction is a long-context problem: real scenes come with hundreds of images. But O(N²) transformer-based models don’t scale efficiently. Introducing: 🤐ZipMap (CVPR ’26): Linear-Time, Stateful 3D Reconstruction via Test-Time Training (TTT). ZipMap “zips” a large image collection into an implicit TTT scene state in a single linear-time operation. The state will then be decoded into spatial outputs, and can be queried efficiently for novel-view geometry and appearance (~100 FPS) ZipMap is not only much faster (>20× faster than VGGT), but also matches or surpasses the accuracy of all SOTA models.

Haian Jin

78,899 просмотров • 4 месяцев назад

Large-scale 3D Scene Generation (all scenes are real-time rendered)!! Physically-grounded generative data without hallucinations is the missing link for robot learning and testing at scale. We introduce a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal view synthesis and generation with object permanence and explicit 3D geometry. This also allows for extreme trajectory extrapolation without failure! We also show that we can build fully data-driven simulators for end-to-end learning with this approach. Project: with the amazing team of Julian Ost, Amogh Joshi , Andrea Ramazzina, Maximilian Bömer, Mario Bijelic.

Large-scale 3D Scene Generation (all scenes are real-time rendered)!! Physically-grounded generative data without hallucinations is the missing link for robot learning and testing at scale. We introduce a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal view synthesis and generation with object permanence and explicit 3D geometry. This also allows for extreme trajectory extrapolation without failure! We also show that we can build fully data-driven simulators for end-to-end learning with this approach. Project: with the amazing team of Julian Ost, Amogh Joshi , Andrea Ramazzina, Maximilian Bömer, Mario Bijelic.

Felix Heide

27,779 просмотров • 10 месяцев назад

📢📢 𝐀𝐯𝐚𝐭𝟑𝐫 📢📢 Avat3r creates high-quality 3D head avatars from just a few input images in a single forward pass with a new dynamic 3DGS reconstruction model. Video: Project: Our core idea is to make Gaussian Reconstruction Models animatable. We find that a simple cross-attention to an expression code sequence is already sufficient to model complex facial expressions. We then incorporate position maps from DUSt3R and feature maps from Sapiens to facilitate the prediction task. While DUSt3R's position maps act as a pixel-aligned initialization for the Gaussians' positions, the Sapiens feature maps help the cross-view transformer to match corresponding image tokens in the 4 input images. One major challenge in creating a 3D head avatar from smartphone images comes from inconsistent facial expressions when the subject could not remain perfectly static during the capture. We eliminate this static requirement by simply showing our model input images with different facial expressions during training. This technique makes our model robust to inconsistent input images later on. Finally, we show that despite the model has been trained with 4 input images, one can even create a 3D head avatar when only a single image is available. To achieve this, we employ a pre-trained 3D GAN to lift the single image to 3D and then render the 4 input images for our model. This allows us to create 3D head avatars from single images and even highly out-of-distribution examples like AI generated faces, paintings or statues. Great work by Tobias Kirschstein from his internship at Meta with Javier Romero, Artem Sevastopolsky, and Shunsuke Saito

Matthias Niessner

74,763 просмотров • 1 год назад

[SIGGRAPH '25] Monocular Online Reconstruction with Enhanced Detail Preservation Abstract (excerpt): Our approach addresses two key challenges in monocular online reconstruction: 1. Distributing Gaussians without relying on depth maps. 2. Ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: - Hierarchical Gaussian Management Module: For effective Gaussian distribution. - Global Consistency Optimization Module: For maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians to capture details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency.

[SIGGRAPH '25] Monocular Online Reconstruction with Enhanced Detail Preservation Abstract (excerpt): Our approach addresses two key challenges in monocular online reconstruction: 1. Distributing Gaussians without relying on depth maps. 2. Ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: - Hierarchical Gaussian Management Module: For effective Gaussian distribution. - Global Consistency Optimization Module: For maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians to capture details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency.

MrNeRF

23,638 просмотров • 1 год назад

[SIGGRAPH '25] TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling Note: On the left that's a 3DGS rendering! Contributions: 1. We propose a simple approach for rigging 3D Gaussians within the continuous tangent space of 3DMM face models, allowing Gaussians to move freely across mesh triangles. 2. We propose a novel CNN-based deformation model that is agnostic to the number of 3D Gaussians, naturally enabling adaptively densification of the representation to improve detail where most needed, with expression-dependent shading. 3. We show significant improvements over baseline SOTA methods and demonstrate the ability to render even extreme close-up images at high quality.

[SIGGRAPH '25] TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling Note: On the left that's a 3DGS rendering! Contributions: 1. We propose a simple approach for rigging 3D Gaussians within the continuous tangent space of 3DMM face models, allowing Gaussians to move freely across mesh triangles. 2. We propose a novel CNN-based deformation model that is agnostic to the number of 3D Gaussians, naturally enabling adaptively densification of the representation to improve detail where most needed, with expression-dependent shading. 3. We show significant improvements over baseline SOTA methods and demonstrate the ability to render even extreme close-up images at high quality.

MrNeRF

29,010 просмотров • 1 год назад