Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

How can we learn 3D visual grounding with natural supervision—only looking at QA pairs, without ground truth bounding boxes or object classification labels? We inject explicit language priors, e.g., the symmetric property that A near B ⇒ B near A, in structured vision models.

Joy Hsu

2,618 subscribers

14,843 Aufrufe • vor 2 Jahren •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

4 Kommentare

Profilbild von Joy Hsu

Joy Hsuvor 2 Jahren

Our Language-Regularized Concept Learner (LARC) extracts language constraints from LLMs, and uses them as regularization on the interpretable & intermediate representations of neuro-symbolic models. LARC's representations encode a variety of logical language properties.

Profilbild von Joy Hsu

Joy Hsuvor 2 Jahren

LARC shows strong data efficiency and generalization to new concepts & dataset, and is a promising step towards injecting structured visual reasoning frameworks with explicit language-based priors, for learning in settings without dense visual supervision.

Profilbild von Joy Hsu

Joy Hsuvor 2 Jahren

Will be presenting LARC at @CVPR next Thursday morning (session #3) with our fantastic intern @chunfeng3364, as well as @Weiyu_Liu_ and @jiajunwu_cs! Paper: Website: Code:

Profilbild von BEATRIX 李 OBSERVE 💜 ME

BEATRIX 李 OBSERVE 💜 MEvor 2 Jahren

my nudes in profile

Ähnliche Videos

Can we synthesize 3D human-scene interactions without learning from any 3D data? Yes! Check out Lei Li's GenZI, a novel zero-shot approach to generating 3D interactions by distilling priors from large vision-language models.

Can we synthesize 3D human-scene interactions without learning from any 3D data? Yes! Check out Lei Li's GenZI, a novel zero-shot approach to generating 3D interactions by distilling priors from large vision-language models.

Angela Dai

106,850 Aufrufe • vor 2 Jahren

Doppelgangers: Learning to Disambiguate Images of Similar Structures paper page: We consider the visual disambiguation task of determining whether a pair of visually similar images depict the same or distinct 3D surfaces (e.g., the same or opposite sides of a symmetric building). Illusory image matches, where two images observe distinct but visually similar 3D surfaces, can be challenging for humans to differentiate, and can also lead 3D reconstruction algorithms to produce erroneous results. We propose a learning-based approach to visual disambiguation, formulating it as a binary classification task on image pairs. To that end, we introduce a new dataset for this problem, Doppelgangers, which includes image pairs of similar structures with ground truth labels. We also design a network architecture that takes the spatial distribution of local keypoints and matches as input, allowing for better reasoning about both local and global cues. Our evaluation shows that our method can distinguish illusory matches in difficult cases, and can be integrated into SfM pipelines to produce correct, disambiguated 3D reconstructions.

Doppelgangers: Learning to Disambiguate Images of Similar Structures paper page: We consider the visual disambiguation task of determining whether a pair of visually similar images depict the same or distinct 3D surfaces (e.g., the same or opposite sides of a symmetric building). Illusory image matches, where two images observe distinct but visually similar 3D surfaces, can be challenging for humans to differentiate, and can also lead 3D reconstruction algorithms to produce erroneous results. We propose a learning-based approach to visual disambiguation, formulating it as a binary classification task on image pairs. To that end, we introduce a new dataset for this problem, Doppelgangers, which includes image pairs of similar structures with ground truth labels. We also design a network architecture that takes the spatial distribution of local keypoints and matches as input, allowing for better reasoning about both local and global cues. Our evaluation shows that our method can distinguish illusory matches in difficult cases, and can be integrated into SfM pipelines to produce correct, disambiguated 3D reconstructions.

AK

76,878 Aufrufe • vor 2 Jahren

Most multi-view reconstruction models need full supervision. We show they can self-improve without any ground truth labels. Introducing SelfEvo: Self-Improving 4D Perception via Self-Distillation. Up to +36.5% in video depth, +20.1% in camera estimation, zero annotation.

Most multi-view reconstruction models need full supervision. We show they can self-improve without any ground truth labels. Introducing SelfEvo: Self-Improving 4D Perception via Self-Distillation. Up to +36.5% in video depth, +20.1% in camera estimation, zero annotation.

Qianqian Wang

24,309 Aufrufe • vor 2 Monaten

Object Cutter Create high-quality HD background removal for ANY object in your image with a text prompt or bounding boxes!

Object Cutter Create high-quality HD background removal for ANY object in your image with a text prompt or bounding boxes!

Gradio

135,968 Aufrufe • vor 1 Jahr

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.

Jiafei Duan

12,072 Aufrufe • vor 2 Monaten

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 Aufrufe • vor 1 Jahr

What happens when VLMs meet 3D foundation models? See VLM-3R (CVPR 2026). VLM-3R links a vision-language model (e.g., Qwen) with 3D geometric foundation models (e.g., CUT3R) at metric scale. Given an uncalibrated video, it moves beyond pixels to perceive and reason in 3D space. Code (open source):

What happens when VLMs meet 3D foundation models? See VLM-3R (CVPR 2026). VLM-3R links a vision-language model (e.g., Qwen) with 3D geometric foundation models (e.g., CUT3R) at metric scale. Given an uncalibrated video, it moves beyond pixels to perceive and reason in 3D space. Code (open source):

Zhiwen(Aaron) Fan

10,625 Aufrufe • vor 3 Monaten

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 Aufrufe • vor 2 Jahren

Without using any text prompting, we can slice images together in a purely visual language:

Without using any text prompting, we can slice images together in a purely visual language:

Gytis Daujotas

68,491 Aufrufe • vor 1 Jahr

🎉 恭喜发财🧧🐍 As we welcome the Chinese New Year, we're thrilled to announce the launch of Qwen2.5-VL , our latest flagship vision-language model! 🚀 💗 Qwen Chat: 📖 Blog: 🤗 Hugging Face: 🤖 ModelScope: 🌟 Key Highlights: * Visual Understanding : From flowers to complex charts, Qwen2.5-VL sees it all! * Agentic Capabilities : It’s a visual agent that can reason and interact with tools like computers & phones. * Long Video Comprehension : Captures events in videos over 1 hour long! ⏳🎥 * Precise Localization : Generates bounding boxes & JSON outputs for accurate object detection. * Structured Data Outputs : Perfect for finance & commerce, handling invoices, forms & more! 💼📊 Try Qwen2.5-VL now at Qwen Chat or explore models on Hugging Face & ModelScope . 🌐

🎉 恭喜发财🧧🐍 As we welcome the Chinese New Year, we're thrilled to announce the launch of Qwen2.5-VL , our latest flagship vision-language model! 🚀 💗 Qwen Chat: 📖 Blog: 🤗 Hugging Face: 🤖 ModelScope: 🌟 Key Highlights: * Visual Understanding : From flowers to complex charts, Qwen2.5-VL sees it all! * Agentic Capabilities : It’s a visual agent that can reason and interact with tools like computers & phones. * Long Video Comprehension : Captures events in videos over 1 hour long! ⏳🎥 * Precise Localization : Generates bounding boxes & JSON outputs for accurate object detection. * Structured Data Outputs : Perfect for finance & commerce, handling invoices, forms & more! 💼📊 Try Qwen2.5-VL now at Qwen Chat or explore models on Hugging Face & ModelScope . 🌐

Qwen

762,213 Aufrufe • vor 1 Jahr

Starting the new year without human labeling 🎉!! Multimodal lidar-camera data is a gold mine of dense 3D geometry hiding in plain sight. For supervised pretraining and validation at scale at Torc-Robotics, we rely on fully automated pseudo-labeling pipelines. Exploiting geometric priors from temporally accumulated LiDAR maps and an iterative update rule enforces joint geometric–semantic consistency while detecting moving objects via inconsistencies. We achieve 3D semantic labels and 3D bounding boxes with human-like quality at 200m+ range required for highway driving. Paper: Exciting work with Torc-Robotics with Filippo Ghilotti, Samuel Brucker, Nahku Saidy, Matteo Matteucci, Mario Bijelic.

Starting the new year without human labeling 🎉!! Multimodal lidar-camera data is a gold mine of dense 3D geometry hiding in plain sight. For supervised pretraining and validation at scale at Torc-Robotics, we rely on fully automated pseudo-labeling pipelines. Exploiting geometric priors from temporally accumulated LiDAR maps and an iterative update rule enforces joint geometric–semantic consistency while detecting moving objects via inconsistencies. We achieve 3D semantic labels and 3D bounding boxes with human-like quality at 200m+ range required for highway driving. Paper: Exciting work with Torc-Robotics with Filippo Ghilotti, Samuel Brucker, Nahku Saidy, Matteo Matteucci, Mario Bijelic.

Felix Heide

18,202 Aufrufe • vor 5 Monaten

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

Andrew Ng

151,198 Aufrufe • vor 2 Jahren

Gemini has always been my go-to LLM for document extraction - but accurate bounding boxes with agentic vision are a huge value-add in the latest Gemini models. Shown: extracting structured data from an oil & gas lease in Ector County Texas with the best bboxes I've seen yet.

Gemini has always been my go-to LLM for document extraction - but accurate bounding boxes with agentic vision are a huge value-add in the latest Gemini models. Shown: extracting structured data from an oil & gas lease in Ector County Texas with the best bboxes I've seen yet.

Kyle Walker

79,608 Aufrufe • vor 4 Monaten

AI + robotics research is starting to pick up steam. We can now instruct Spot using natural language using Language-guided Skill Coordination (LSC). A user provides a natural language instruction: "Bring me the chocolates box, cereal box, and pill bottle, and put them on the bedroom table" and the robot navigates to the location of the target objects and places them on the room table. Current versions of Spot rely on a fixed-set vocabulary that cannot generalize to diverse instructions. In this demo, the researchers presented a method that uses large language models to receive a free-form natural language instruction for object rearrangement, which it then executes. It combines: -A voice-to-text model that processes the instructions into text -An LLM that converts natural language instruction to call a library of skills -A perception module that provides ground truth locations and visual object detection And what your left with is a glimpse into what our future will hold: robots executing tasks from human voice prompts.

AI + robotics research is starting to pick up steam. We can now instruct Spot using natural language using Language-guided Skill Coordination (LSC). A user provides a natural language instruction: "Bring me the chocolates box, cereal box, and pill bottle, and put them on the bedroom table" and the robot navigates to the location of the target objects and places them on the room table. Current versions of Spot rely on a fixed-set vocabulary that cannot generalize to diverse instructions. In this demo, the researchers presented a method that uses large language models to receive a free-form natural language instruction for object rearrangement, which it then executes. It combines: -A voice-to-text model that processes the instructions into text -An LLM that converts natural language instruction to call a library of skills -A perception module that provides ground truth locations and visual object detection And what your left with is a glimpse into what our future will hold: robots executing tasks from human voice prompts.

AI Breakfast

33,718 Aufrufe • vor 3 Jahren

Meta announces SceneScript Reconstructing Scenes With An Autoregressive Structured Language Model We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our

Meta announces SceneScript Reconstructing Scenes With An Autoregressive Structured Language Model We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our

AK

46,090 Aufrufe • vor 2 Jahren

How far can a very simple eye go in solving vision tasks? Like a 1-pixel camera? Humans have one of the greatest eyes in nature, while many animals have significantly simpler eyes and visual systems yet show complex perceptual behavior. In an interesting project, we find that many computer vision tasks can be solved without a typical camera and with such simple 1-pixel sensors (photoreceptors). We also find that proper design (e.g., where to place the photoreceptors strategically) makes a big difference, so we developed a computational design method to find them. 🌐 👁️[Solving Vision Tasks with Simple Photoreceptors Instead of Cameras] 🧵1/n

How far can a very simple eye go in solving vision tasks? Like a 1-pixel camera? Humans have one of the greatest eyes in nature, while many animals have significantly simpler eyes and visual systems yet show complex perceptual behavior. In an interesting project, we find that many computer vision tasks can be solved without a typical camera and with such simple 1-pixel sensors (photoreceptors). We also find that proper design (e.g., where to place the photoreceptors strategically) makes a big difference, so we developed a computational design method to find them. 🌐 👁️[Solving Vision Tasks with Simple Photoreceptors Instead of Cameras] 🧵1/n

Amir Zamir

75,870 Aufrufe • vor 2 Jahren

🚨 New Paper Hybrid Neural World Models. We build neural world models for discontinuous environments and our method discovers when they're making error, with no ground truth. Our neural models run physics fast, leaping many simulator steps in a single pass. How we find points where not to trust it: ask the model for the same future two ways, and the disagreement between the two answers marks where it's failing without needing labels. The same recipe holds on three systems with nothing in common: a chemical wave, compressible gas with shocks, rigid bodies colliding in 3D. 👇

🚨 New Paper Hybrid Neural World Models. We build neural world models for discontinuous environments and our method discovers when they're making error, with no ground truth. Our neural models run physics fast, leaping many simulator steps in a single pass. How we find points where not to trust it: ask the model for the same future two ways, and the disagreement between the two answers marks where it's failing without needing labels. The same recipe holds on three systems with nothing in common: a chemical wave, compressible gas with shocks, rigid bodies colliding in 3D. 👇

Lossfunk

13,533 Aufrufe • vor 25 Tagen

We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs

We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs

Zhiwen(Aaron) Fan

14,895 Aufrufe • vor 1 Jahr

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors paper page: present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images.

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors paper page: present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images.

AK

305,643 Aufrufe • vor 3 Jahren

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

AK

62,768 Aufrufe • vor 3 Jahren