Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Christian Rupprecht explains their interpretability research in 3D computer vision, testing if (and where in the model) multi-view transformers like VGGT, DepthAnything 3, and DUSt3R use point/patch correspondences to make sense of 3D scene geometry.

Chris Offner

4,192 subscribers

74,121 Aufrufe • vor 2 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

VGGT: Visual Geometry Grounded Transformer TL;DR: Is DUSt3R facing a formidable new rival? Contributions: (1) We introduce VGGT, a large feed-forward transformer that can, given one, a few, or even hundreds of images of a scene, predict all its key 3D attributes - including camera intrinsics and extrinsics, point maps, depth maps, and 3D point tracks - in seconds. (2) We demonstrate that VGGT’s predictions are directly usable, being highly competitive and usually better than those of state-of-the-art methods that use slow post-processing optimization techniques. (3) We also show that when further combined with BA post-processing, VGGT achieves state-of-the-art results across the board, even when compared to methods that specialize in a subset of 3D tasks, often improving quality substantially.

VGGT: Visual Geometry Grounded Transformer TL;DR: Is DUSt3R facing a formidable new rival? Contributions: (1) We introduce VGGT, a large feed-forward transformer that can, given one, a few, or even hundreds of images of a scene, predict all its key 3D attributes - including camera intrinsics and extrinsics, point maps, depth maps, and 3D point tracks - in seconds. (2) We demonstrate that VGGT’s predictions are directly usable, being highly competitive and usually better than those of state-of-the-art methods that use slow post-processing optimization techniques. (3) We also show that when further combined with BA post-processing, VGGT achieves state-of-the-art results across the board, even when compared to methods that specialize in a subset of 3D tasks, often improving quality substantially.

MrNeRF

29,461 Aufrufe • vor 1 Jahr

SAM 3D enables accurate 3D reconstruction from a single image, supporting real-world applications in editing, robotics, and interactive scene generation. Matt, a SAM 3D researcher, explains how the two-model design makes this possible for both people and complex environments. 🔗 Read the SAM 3D Objects research paper: 🔗 Read the SAM 3D Body research paper:

SAM 3D enables accurate 3D reconstruction from a single image, supporting real-world applications in editing, robotics, and interactive scene generation. Matt, a SAM 3D researcher, explains how the two-model design makes this possible for both people and complex environments. 🔗 Read the SAM 3D Objects research paper: 🔗 Read the SAM 3D Body research paper:

AI at Meta

17,858 Aufrufe • vor 6 Monaten

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 Aufrufe • vor 2 Jahren

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 Aufrufe • vor 1 Jahr

SpatialTrackerV2: unified, end-to-end 3D point tracking model which simultaneously estimates Camera Motion, Consistent Geometry and Pixel-wise 3D Trajectories.

SpatialTrackerV2: unified, end-to-end 3D point tracking model which simultaneously estimates Camera Motion, Consistent Geometry and Pixel-wise 3D Trajectories.

Bilawal Sidhu

20,346 Aufrufe • vor 11 Monaten

Love these old meets new vfx workflows 1. extract a still of the object you want to augment 2. use image-to-3d to make a 3d model of the object 3. use that 3d geometry for classical object tracking Then you can go wild with complete control

Love these old meets new vfx workflows 1. extract a still of the object you want to augment 2. use image-to-3d to make a 3d model of the object 3. use that 3d geometry for classical object tracking Then you can go wild with complete control

Bilawal Sidhu

33,824 Aufrufe • vor 1 Jahr

Instant Video-to-3D with DUST3R Dust3r generates a whole 3D scene from just a couple of images. What if it could: 1. Accept a VIDEO 2. Extract the video frames 3. Turn them into 3D? So added a Gradio Video Component. Here's me generating 3D from a video cc: NAVER LABS Europe

Instant Video-to-3D with DUST3R Dust3r generates a whole 3D scene from just a couple of images. What if it could: 1. Accept a VIDEO 2. Extract the video frames 3. Turn them into 3D? So added a Gradio Video Component. Here's me generating 3D from a video cc: NAVER LABS Europe

cocktail peanut

41,029 Aufrufe • vor 2 Jahren

What if #ChatGPT were 3D? Structure #GPT4's responses in 3D with #Sensecape to better understand large amounts of text. New Human-Computer Interaction #HCI research makes #AI easier to use. More at

What if #ChatGPT were 3D? Structure #GPT4's responses in 3D with #Sensecape to better understand large amounts of text. New Human-Computer Interaction #HCI research makes #AI easier to use. More at

Haijun Xia

79,486 Aufrufe • vor 3 Jahren

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

AK

161,530 Aufrufe • vor 2 Jahren

Having fun with computer vision! 👁️ Converted 🤗 image → 3D, then dragged to AA's 3d model playground

Having fun with computer vision! 👁️ Converted 🤗 image → 3D, then dragged to AA's 3d model playground

apolinario 🌐

20,193 Aufrufe • vor 11 Monaten

Most generative models predict pixels. Predicting a 3D scene instead has many benefits: the scene won’t change if you look away and come back, and it obeys the basic physical rules of 3D geometry. The simplest way to visualize the 3D scene is a depth map, where each pixel is colored by its distance to the camera. 4/n

Most generative models predict pixels. Predicting a 3D scene instead has many benefits: the scene won’t change if you look away and come back, and it obeys the basic physical rules of 3D geometry. The simplest way to visualize the 3D scene is a depth map, where each pixel is colored by its distance to the camera. 4/n

World Labs

16,605 Aufrufe • vor 1 Jahr

We are thrilled to announce a major upgrade to our open-source 3D generation model, introducing two groundbreaking new versions: 3D 2.0 MV (Multi-View Generation) and 3D 2.0 Mini! 3D 2.0 MV : 3D 2.0Mini:

We are thrilled to announce a major upgrade to our open-source 3D generation model, introducing two groundbreaking new versions: 3D 2.0 MV (Multi-View Generation) and 3D 2.0 Mini! 3D 2.0 MV : 3D 2.0Mini:

Tencent HY

134,410 Aufrufe • vor 1 Jahr

Excited to share this demo from Over the Reality! Watch our Unitree Go2 robodog navigating our office while reconstructing the 3D space in real-time using the VGGT-based foundation vision model. This is a prime example of machine perception in action, turning raw RGB camera feeds into rich, detailed 3D maps! The robodog's RGB cam generates a dense, textured 3D reconstruction via VGGT from a few photograms, capturing nuances like object shapes and surfaces with impressive fidelity (main view). Compare that to the standard LiDAR system (top right), it's sparser, more point-cloud focused, lacking the visual richness. Vision models are closing the gap fast! What's powering this? VGGT, a cutting-edge foundation model for 3D perception, trained on datasets orders of magnitude smaller than our massive OVER 3D maps dataset. Imagine the leap when we apply OVER's scale to VGGT-like transformer based architectures, denser reconstructions, better generalization, revolutionary for robotics, machine perception & AR! Stay tuned for more breakthroughs at the intersection of AI, robotics, and DePIN. We're building the future of Physical AI and Spatial Computing at Over the Reality 🌐 What do you think, ready for robodogs in your world? Drop your thoughts! 🤖🌐

Excited to share this demo from Over the Reality! Watch our Unitree Go2 robodog navigating our office while reconstructing the 3D space in real-time using the VGGT-based foundation vision model. This is a prime example of machine perception in action, turning raw RGB camera feeds into rich, detailed 3D maps! The robodog's RGB cam generates a dense, textured 3D reconstruction via VGGT from a few photograms, capturing nuances like object shapes and surfaces with impressive fidelity (main view). Compare that to the standard LiDAR system (top right), it's sparser, more point-cloud focused, lacking the visual richness. Vision models are closing the gap fast! What's powering this? VGGT, a cutting-edge foundation model for 3D perception, trained on datasets orders of magnitude smaller than our massive OVER 3D maps dataset. Imagine the leap when we apply OVER's scale to VGGT-like transformer based architectures, denser reconstructions, better generalization, revolutionary for robotics, machine perception & AR! Stay tuned for more breakthroughs at the intersection of AI, robotics, and DePIN. We're building the future of Physical AI and Spatial Computing at Over the Reality 🌐 What do you think, ready for robodogs in your world? Drop your thoughts! 🤖🌐

Over the Reality 🌐

359,647 Aufrufe • vor 9 Monaten

Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art results 🚀 One universal model enables SoTA for: 🔥 Mono Depth Estimation 🔥 Multi-View SfM 🔥 Multi-View Stereo 🔥 Depth Completion 🔥 Registration … and many more possibilities! – plus everything is metric 🎯 We release code for data processing, training, benchmarking & ablations – everything Apache 2.0! Details & Links 👇

Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art results 🚀 One universal model enables SoTA for: 🔥 Mono Depth Estimation 🔥 Multi-View SfM 🔥 Multi-View Stereo 🔥 Depth Completion 🔥 Registration … and many more possibilities! – plus everything is metric 🎯 We release code for data processing, training, benchmarking & ablations – everything Apache 2.0! Details & Links 👇

Nikhil Keetha

122,575 Aufrufe • vor 9 Monaten

New open-source 3D world-generation model. I'm rendering a couple of worlds in the video, so check it out. You'll find the GitHub and the Hugging Face links to the model below. This is a multi-modal world model that you can use for a bunch of things: • To generate new worlds • To reconstruct worlds • To simulate 3D interactive worlds from a prompt, images, or a video You can edit the 3D outputs in Unity and Unreal Engine (they export as meshes, 3DGS files, and point clouds). You can also generate 3D characters in the world and walk around. Pretty fun stuff!

New open-source 3D world-generation model. I'm rendering a couple of worlds in the video, so check it out. You'll find the GitHub and the Hugging Face links to the model below. This is a multi-modal world model that you can use for a bunch of things: • To generate new worlds • To reconstruct worlds • To simulate 3D interactive worlds from a prompt, images, or a video You can edit the 3D outputs in Unity and Unreal Engine (they export as meshes, 3DGS files, and point clouds). You can also generate 3D characters in the world and walk around. Pretty fun stuff!

Santiago

65,411 Aufrufe • vor 2 Monaten

[1/N] Current visual geometry prediction models primarily rely on labeled 3D data. Our CVPR26 paper, Flow3r, allows additionally leveraging unlabeled videos (using flow supervision) for scalable visual geometry learning, enabling accurate multi-view 3D reconstruction in-the-wild.

[1/N] Current visual geometry prediction models primarily rely on labeled 3D data. Our CVPR26 paper, Flow3r, allows additionally leveraging unlabeled videos (using flow supervision) for scalable visual geometry learning, enabling accurate multi-view 3D reconstruction in-the-wild.

Shubham Tulsiani

15,974 Aufrufe • vor 3 Monaten

Use the Compositor node for quick architectural visualizations. Import your 3D model, dial in the view, sketch a loose composition, and render.

Use the Compositor node for quick architectural visualizations. Import your 3D model, dial in the view, sketch a loose composition, and render.

Fuser

14,040 Aufrufe • vor 3 Monaten

🚀 Introducing Meshy-4 — The next-gen 3D model generator is now available to everyone at ✨ Experience dramatically improved mesh geometry in both Image to 3D and Text to 3D workflows. ❤️ Watch the video and discover Meshy’s full potential with us!

🚀 Introducing Meshy-4 — The next-gen 3D model generator is now available to everyone at ✨ Experience dramatically improved mesh geometry in both Image to 3D and Text to 3D workflows. ❤️ Watch the video and discover Meshy’s full potential with us!

MeshyAI

52,437 Aufrufe • vor 1 Jahr

AI fully taking over 3D is inevitable you can now add 3D bg for 3D models inside 3d software using World Labs's Marble and.. view in real time and even use Octane's 3DGS to light up the model without adding lights this is impossible w/o AI

AI fully taking over 3D is inevitable you can now add 3D bg for 3D models inside 3d software using World Labs's Marble and.. view in real time and even use Octane's 3DGS to light up the model without adding lights this is impossible w/o AI

el.cine

174,354 Aufrufe • vor 7 Monaten

"YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting" TL;DR: a unified 3D Gaussian splatting model that reconstructs high-quality scene geometry and camera poses from unposed/uncalibrated images in a single forward pass.

"YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting" TL;DR: a unified 3D Gaussian splatting model that reconstructs high-quality scene geometry and camera poses from unposed/uncalibrated images in a single forward pass.

Alexandre Morgand

14,839 Aufrufe • vor 3 Monaten