Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

VGGT: Visual Geometry Grounded Transformer TL;DR: Is DUSt3R facing a formidable new rival? Contributions: (1) We introduce VGGT, a large feed-forward transformer that can, given one, a few, or even hundreds of images of a scene, predict all its key 3D attributes - including camera intrinsics and extrinsics, point... maps, depth maps, and 3D point tracks - in seconds. (2) We demonstrate that VGGT’s predictions are directly usable, being highly competitive and usually better than those of state-of-the-art methods that use slow post-processing optimization techniques. (3) We also show that when further combined with BA post-processing, VGGT achieves state-of-the-art results across the board, even when compared to methods that specialize in a subset of 3D tasks, often improving quality substantially.show more

MrNeRF

14,174 subscribers

29,461 görüntüleme • 1 yıl önce •via X (Twitter)

Eğitim Bilim & Teknoloji Sanat

Anya Rossi• Live Now

Private livecam show

12 Yorum

MrNeRF profil fotoğrafı

MrNeRF1 yıl önce

Paper (pdf): Code:

MrNeRF profil fotoğrafı

MrNeRF1 yıl önce

Thanks for bringing this paper to my attention!

MrNeRF profil fotoğrafı

MrNeRF1 yıl önce

I'm crafting an email newsletter that turns my daily updates into a captivating weekly digest, complete with exclusive content. Although it's not live yet, you can sign up now! If you're curious, visit my website and join the subscriber list today!

MrNeRF profil fotoğrafı

MrNeRF1 yıl önce

Original author's post:

OPEN profil fotoğrafı

OPEN2 yıl önce

Introducing OPEN, the first genre-defining AAA metaverse gaming experience with top-tier IP powered by web3 technology. Coming to @thereadyverse. #opensoon

Pablo Vela profil fotoğrafı

Pablo Vela1 yıl önce

Gah looks so cool, still not MIT/Apache 😭😭

MrNeRF profil fotoğrafı

MrNeRF1 yıl önce

Yeah, but it is nice to see someone breaking into this monopoly which is good!

Abdullah Hamdi profil fotoğrafı

Abdullah Hamdi1 yıl önce

Our VGG group

Jianyuan Wang profil fotoğrafı

Jianyuan Wang1 yıl önce

Thanks for sharing! We released it in a silent mode for a while but was quickly caught lol

MrNeRF profil fotoğrafı

MrNeRF1 yıl önce

The silence is over :D. Awesome paper, thank you!

Sir Mr Meow Meow profil fotoğrafı

Sir Mr Meow Meow1 yıl önce

interesting

MrNeRF profil fotoğrafı

MrNeRF1 yıl önce

Yes, quite impressive!

Benzer Videolar

Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds! No expensive optimization needed, yet delivers SOTA results for: ✅ Camera Pose Estimation ✅ Multi-view Depth Estimation ✅ Dense Point Cloud Reconstruction ✅ Point Tracking Project Page: Code & Weights:

Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds! No expensive optimization needed, yet delivers SOTA results for: ✅ Camera Pose Estimation ✅ Multi-view Depth Estimation ✅ Dense Point Cloud Reconstruction ✅ Point Tracking Project Page: Code & Weights:

Jianyuan

203,078 görüntüleme • 1 yıl önce

OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering Contributions: • We propose an occlusion-aware scene division strategy that considers the scene layout and camera co-visibilities. The resulting regions barely contain occlusions, and the corresponding training cameras have a higher average contribution, leading to improved reconstruction results. • We present a region-based rendering technique that accelerates 3D Gaussian splatting in large scenes. It eliminates much of the time-consuming processing of invisible 3D Gaussians, boosting rendering speeds without noticeable quality degradation. • We conduct extensive experiments on several large-scene datasets and demonstrate that OccluGaussian achieves superior rendering quality and faster rendering speed compared to previous state-of-the-art methods.

OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering Contributions: • We propose an occlusion-aware scene division strategy that considers the scene layout and camera co-visibilities. The resulting regions barely contain occlusions, and the corresponding training cameras have a higher average contribution, leading to improved reconstruction results. • We present a region-based rendering technique that accelerates 3D Gaussian splatting in large scenes. It eliminates much of the time-consuming processing of invisible 3D Gaussians, boosting rendering speeds without noticeable quality degradation. • We conduct extensive experiments on several large-scene datasets and demonstrate that OccluGaussian achieves superior rendering quality and faster rendering speed compared to previous state-of-the-art methods.

MrNeRF

10,718 görüntüleme • 1 yıl önce

Christian Rupprecht explains their interpretability research in 3D computer vision, testing if (and where in the model) multi-view transformers like VGGT, DepthAnything 3, and DUSt3R use point/patch correspondences to make sense of 3D scene geometry.

Christian Rupprecht explains their interpretability research in 3D computer vision, testing if (and where in the model) multi-view transformers like VGGT, DepthAnything 3, and DUSt3R use point/patch correspondences to make sense of 3D scene geometry.

Chris Offner

74,182 görüntüleme • 2 ay önce

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 görüntüleme • 2 yıl önce

Want high-quality 3D meshes with sharp geometric details? Try our newly released MeshFormer! It only takes 8 GPUs for two days of training, outperforming state-of-the-art models that use over a hundred GPUs! With 3D-native input guidance, representations, supervision, and post-processing, we significantly improve the training efficiency and geometric quality of feed-forward reconstruction models! Project page: Chong Zeng

Want high-quality 3D meshes with sharp geometric details? Try our newly released MeshFormer! It only takes 8 GPUs for two days of training, outperforming state-of-the-art models that use over a hundred GPUs! With 3D-native input guidance, representations, supervision, and post-processing, we significantly improve the training efficiency and geometric quality of feed-forward reconstruction models! Project page: Chong Zeng

Minghua Liu @ SIGGRAPHASIA25

16,320 görüntüleme • 1 yıl önce

Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art results 🚀 One universal model enables SoTA for: 🔥 Mono Depth Estimation 🔥 Multi-View SfM 🔥 Multi-View Stereo 🔥 Depth Completion 🔥 Registration … and many more possibilities! – plus everything is metric 🎯 We release code for data processing, training, benchmarking & ablations – everything Apache 2.0! Details & Links 👇

Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art results 🚀 One universal model enables SoTA for: 🔥 Mono Depth Estimation 🔥 Multi-View SfM 🔥 Multi-View Stereo 🔥 Depth Completion 🔥 Registration … and many more possibilities! – plus everything is metric 🎯 We release code for data processing, training, benchmarking & ablations – everything Apache 2.0! Details & Links 👇

Nikhil Keetha

122,648 görüntüleme • 9 ay önce

Excited to share this demo from Over the Reality! Watch our Unitree Go2 robodog navigating our office while reconstructing the 3D space in real-time using the VGGT-based foundation vision model. This is a prime example of machine perception in action, turning raw RGB camera feeds into rich, detailed 3D maps! The robodog's RGB cam generates a dense, textured 3D reconstruction via VGGT from a few photograms, capturing nuances like object shapes and surfaces with impressive fidelity (main view). Compare that to the standard LiDAR system (top right), it's sparser, more point-cloud focused, lacking the visual richness. Vision models are closing the gap fast! What's powering this? VGGT, a cutting-edge foundation model for 3D perception, trained on datasets orders of magnitude smaller than our massive OVER 3D maps dataset. Imagine the leap when we apply OVER's scale to VGGT-like transformer based architectures, denser reconstructions, better generalization, revolutionary for robotics, machine perception & AR! Stay tuned for more breakthroughs at the intersection of AI, robotics, and DePIN. We're building the future of Physical AI and Spatial Computing at Over the Reality 🌐 What do you think, ready for robodogs in your world? Drop your thoughts! 🤖🌐

Excited to share this demo from Over the Reality! Watch our Unitree Go2 robodog navigating our office while reconstructing the 3D space in real-time using the VGGT-based foundation vision model. This is a prime example of machine perception in action, turning raw RGB camera feeds into rich, detailed 3D maps! The robodog's RGB cam generates a dense, textured 3D reconstruction via VGGT from a few photograms, capturing nuances like object shapes and surfaces with impressive fidelity (main view). Compare that to the standard LiDAR system (top right), it's sparser, more point-cloud focused, lacking the visual richness. Vision models are closing the gap fast! What's powering this? VGGT, a cutting-edge foundation model for 3D perception, trained on datasets orders of magnitude smaller than our massive OVER 3D maps dataset. Imagine the leap when we apply OVER's scale to VGGT-like transformer based architectures, denser reconstructions, better generalization, revolutionary for robotics, machine perception & AR! Stay tuned for more breakthroughs at the intersection of AI, robotics, and DePIN. We're building the future of Physical AI and Spatial Computing at Over the Reality 🌐 What do you think, ready for robodogs in your world? Drop your thoughts! 🤖🌐

Over the Reality 🌐

359,647 görüntüleme • 9 ay önce

InstantSplat++ is now open source. It is a lightweight library that connects foundation models (VGGT, MASt3R, MAP-Anything, etc.) with the Gaussian splatting family. Given uncalibrated images, it optimizes a 3D scene in a few seconds. Try the demo and code here:

InstantSplat++ is now open source. It is a lightweight library that connects foundation models (VGGT, MASt3R, MAP-Anything, etc.) with the Gaussian splatting family. Given uncalibrated images, it optimizes a 3D scene in a few seconds. Try the demo and code here:

Zhiwen(Aaron) Fan

31,835 görüntüleme • 3 ay önce

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Gordon Wetzstein

19,189 görüntüleme • 2 yıl önce

GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views TL;DR: Are we witnessing the first steps towards 3DGS live streaming? Contributions: • We introduce a generalizable 3D Gaussian Splatting methodology that employs pixel-wise Gaussian parameter maps defined on 2D source image planes to formulate 3D Gaussians in a feed-forward manner. • We propose a fully differentiable framework composed of an iterative depth estimation module and a Gaussian parameter regression module. The intermediate depth prediction bridges the two components and allows them to benefit from joint training. • We introduce a regularization term and an epipolar attention mechanism to preserve geometry consistency between the two source views when using only rendering loss. Our method generalizes well to unseen characters even in complicated scenes. • We develop a real-time FVV system that achieves high-resolution rendering of characters in the scene without any geometry supervision.

GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views TL;DR: Are we witnessing the first steps towards 3DGS live streaming? Contributions: • We introduce a generalizable 3D Gaussian Splatting methodology that employs pixel-wise Gaussian parameter maps defined on 2D source image planes to formulate 3D Gaussians in a feed-forward manner. • We propose a fully differentiable framework composed of an iterative depth estimation module and a Gaussian parameter regression module. The intermediate depth prediction bridges the two components and allows them to benefit from joint training. • We introduce a regularization term and an epipolar attention mechanism to preserve geometry consistency between the two source views when using only rendering loss. Our method generalizes well to unseen characters even in complicated scenes. • We develop a real-time FVV system that achieves high-resolution rendering of characters in the scene without any geometry supervision.

MrNeRF

25,699 görüntüleme • 1 yıl önce

Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives paper page: Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations.

Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives paper page: Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations.

AK

38,568 görüntüleme • 2 yıl önce

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 görüntüleme • 1 yıl önce

MeshSplatting: Differentiable Rendering with Opaque Meshes Contributions: (i) An end-to-end optimization of mesh-based scene representations retains visual quality while training 2× faster than current state-of-the-art methods. (ii) Rather than a polygon soup, we generate a connected mesh by refining the vertex locations of a restricted Delaunay triangulation. (iii) Triangles are naturally connected to each other, and quantities stored within vertices are smoothly interpolated across each triangle. (iv) The optimization is aware that the triangles should be opaque, allowing direct high-quality rendering in standard game engines (see Fig. 1), opening the door for classical techniques like the use of depth buffers and occlusion culling [1, 22].

MeshSplatting: Differentiable Rendering with Opaque Meshes Contributions: (i) An end-to-end optimization of mesh-based scene representations retains visual quality while training 2× faster than current state-of-the-art methods. (ii) Rather than a polygon soup, we generate a connected mesh by refining the vertex locations of a restricted Delaunay triangulation. (iii) Triangles are naturally connected to each other, and quantities stored within vertices are smoothly interpolated across each triangle. (iv) The optimization is aware that the triangles should be opaque, allowing direct high-quality rendering in standard game engines (see Fig. 1), opening the door for classical techniques like the use of depth buffers and occlusion culling [1, 22].

MrNeRF

15,044 görüntüleme • 6 ay önce

Big thanks to AK for highlighting our work! LEO marks our pioneering step towards building an embodied generalist agent that can really comprehend the 3D world! 🚀Leveraging LLMs, we train LEO with real and synthetic 3D data across a diverse spectrum of tasks. It's thrilling to see LEO surpass current state-of-the-art SOTA methods in most benchmarked tasks, all under a single, unified model. 🔥 #Generalist_Agent

Big thanks to AK for highlighting our work! LEO marks our pioneering step towards building an embodied generalist agent that can really comprehend the 3D world! 🚀Leveraging LLMs, we train LEO with real and synthetic 3D data across a diverse spectrum of tasks. It's thrilling to see LEO surpass current state-of-the-art SOTA methods in most benchmarked tasks, all under a single, unified model. 🔥 #Generalist_Agent

Siyuan Huang

22,710 görüntüleme • 2 yıl önce

🍺 LagerNVS (CVPR 2026) 🍺 LagerNVS is a generalizable, feed-forward, real-time Novel View Synthesis network which - performs rendering in real time, - generalizes to in-the-wild data, - works with and without known source cameras, - sets a new state-of-the-art among deterministic methods, - can be paired with a diffusion decoder for generative extrapolation. LagerNVS shows that 3D biases are useful for Novel View Synthesis but explicit 3D representations are not required to achieve them. We use 3D biases in (1) architecture design and (2) pre-training: (1) In NVS with explicit 3D representations (3DGS, NeRF) reconstruction is typically difficult and slow, but rendering is much faster and simpler. We mimic this process in the network design: we use a large (1B params) encoder and a small, lightweight decoder (ViT-B). This allows increasing the network capacity while still achieving real-time rendering. (2) The encoder, initialized from VGGT, was pre-trained with 3D reconstruction objectives, making the initial features 3D aware. Both substantially improve performance. Project page: Code: Paper: Models: Work done with Jianyuan Minghao Chen Christian Rupprecht and Andrea Vedaldi

🍺 LagerNVS (CVPR 2026) 🍺 LagerNVS is a generalizable, feed-forward, real-time Novel View Synthesis network which - performs rendering in real time, - generalizes to in-the-wild data, - works with and without known source cameras, - sets a new state-of-the-art among deterministic methods, - can be paired with a diffusion decoder for generative extrapolation. LagerNVS shows that 3D biases are useful for Novel View Synthesis but explicit 3D representations are not required to achieve them. We use 3D biases in (1) architecture design and (2) pre-training: (1) In NVS with explicit 3D representations (3DGS, NeRF) reconstruction is typically difficult and slow, but rendering is much faster and simpler. We mimic this process in the network design: we use a large (1B params) encoder and a small, lightweight decoder (ViT-B). This allows increasing the network capacity while still achieving real-time rendering. (2) The encoder, initialized from VGGT, was pre-trained with 3D reconstruction objectives, making the initial features 3D aware. Both substantially improve performance. Project page: Code: Paper: Models: Work done with Jianyuan Minghao Chen Christian Rupprecht and Andrea Vedaldi

Stan Szymanowicz

31,454 görüntüleme • 3 ay önce

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

AK

161,530 görüntüleme • 2 yıl önce

📣 New research from GenAI at Meta, introducing Meta 3D Gen: A new system for end-to-end generation of 3D assets from text in <1min. Meta 3D Gen is a new combined AI system that can generate high-quality 3D assets, with both high-resolution textures and material maps end-to-end, producing results that are superior to existing solutions — at 3-10x the speed of existing work in this space. Details in the technical report ➡️

📣 New research from GenAI at Meta, introducing Meta 3D Gen: A new system for end-to-end generation of 3D assets from text in <1min. Meta 3D Gen is a new combined AI system that can generate high-quality 3D assets, with both high-resolution textures and material maps end-to-end, producing results that are superior to existing solutions — at 3-10x the speed of existing work in this space. Details in the technical report ➡️

AI at Meta

408,708 görüntüleme • 2 yıl önce

.介添 shared with us the Game Development Club project, a 3D recreation of an in-game scene, which features a large number of assets, sharing useful skills that help enhance the visual impact of surfaces and methods that add realism to the scene. Read the interview:

.介添 shared with us the Game Development Club project, a 3D recreation of an in-game scene, which features a large number of assets, sharing useful skills that help enhance the visual impact of surfaces and methods that add realism to the scene. Read the interview:

80 LEVEL

18,262 görüntüleme • 1 yıl önce

Geometric Context Transformer for Streaming 3D Reconstruction Contributions: • We introduce LingBot-Map, a streaming 3D foundation model built around Geometric Context Attention (GCA), which maintains three complementary context types – anchor, pose-reference window, and trajectory memory – for efficient and consistent long-sequence streaming inference. • We propose an efficient training recipe based on progressive training and context parallelism with a relative loss formulation for stable long-sequence optimization. • We demonstrate that LingBot-Map achieves state-of-the-art performance on multiple benchmarks (Oxford Spires, Tanks and Temples, ETH3D, and 7-Scenes), significantly outperforming existing streaming approaches in reconstruction quality and inference speed.

Geometric Context Transformer for Streaming 3D Reconstruction Contributions: • We introduce LingBot-Map, a streaming 3D foundation model built around Geometric Context Attention (GCA), which maintains three complementary context types – anchor, pose-reference window, and trajectory memory – for efficient and consistent long-sequence streaming inference. • We propose an efficient training recipe based on progressive training and context parallelism with a relative loss formulation for stable long-sequence optimization. • We demonstrate that LingBot-Map achieves state-of-the-art performance on multiple benchmarks (Oxford Spires, Tanks and Temples, ETH3D, and 7-Scenes), significantly outperforming existing streaming approaches in reconstruction quality and inference speed.

MrNeRF

24,623 görüntüleme • 2 ay önce