Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Supervised learning has held 3D Vision back for too long. Meet RayZer — a self-supervised 3D model trained with zero 3D labels: ❌ No supervision of camera & geometry ✅ Just RGB images And the wild part? RayZer outperforms supervised methods (as 3D labels from COLMAP is noisy) 🌐... show more

Hanwen Jiang

1,970 subscribers

69,527 views • 1 year ago •via X (Twitter)

Health & Wellness Science & Technology Education

Anya Rossi• Live Now

Private livecam show

9 Comments

Hanwen Jiang1 year ago

🔍 How does RayZer work? It performs 3D-aware image auto-encoding, which first disentangles images into scene + camera (reconstruction), then re-entangles them back into images (rendering) and learn via RGB loss. The key is splitting the images into two sets — one set to reconstruct scene, and the other to provide supervision, which avoids trivial non-3D solutions.

Hanwen Jiang1 year ago

🤯 RayZer outperforms supervised methods — why? Turns out, 3D labels from COLMAP are noisy. GS-LRM and LVSM consistently fail on scenes of glasses, high luminance intensity, and white walls. These are cases where COLMAP usually fail. This highlights the need for self-supervised learning — and shows just how powerful it can be.

Hanwen Jiang1 year ago

RayZer is similar to video generation models philosophically: ❌ No 3D-aware architecture ❌ No 3D representation & rendering equation ❌ No 3D supervision ✅ But 3D awareness emerges. (We show more inference results)

Hanwen Jiang1 year ago

Joint work with @HaoTan5 @totoro97_ @Haian_Jin @__yuezhao__ @Sai__Bi @KaiZhang9546 @fujun_luan Kalyan Sunkavalli @qixing_huang @geopavlakos

Dmytro Mishkin 🇺🇦1 year ago

Amazing! Dare to try it in Image Matching Challenge? :)

Hanwen Jiang1 year ago

haha, I don't think it works on images with different lighting conditions now

relu1 year ago

Super cool. I’ve been looking for pose estimation without any supervision from SfM and couldn’t find any papers! Was super surprised. I’m glad someone finally got this working

Jeffrey Ouyang-Zhang1 year ago

cool work!

Jang Hyun (Vincent) Cho1 year ago

amazing

Related Videos

(1/N) Will this be the BERT/GPT moment for 3D vision？ Finally, unsupervised pre-training for 3D works. Led by Qitao Zhao , we present E-RayZer — a fully self-supervised 3D reconstruction model that: 🔥Matches or surpasses supervised methods like VGGT 👀Learns transferable 3D representations, outperforming CroCo, VideoMAE, and DINO 📈Scales with more unlabeled data A new recipe for scalable 3D foundation models.

(1/N) Will this be the BERT/GPT moment for 3D vision？ Finally, unsupervised pre-training for 3D works. Led by Qitao Zhao , we present E-RayZer — a fully self-supervised 3D reconstruction model that: 🔥Matches or surpasses supervised methods like VGGT 👀Learns transferable 3D representations, outperforming CroCo, VideoMAE, and DINO 📈Scales with more unlabeled data A new recipe for scalable 3D foundation models.

Hanwen Jiang

57,886 views • 6 months ago

Introducing “FlowMap”, the first self-supervised, differentiable structure-from-motion method that is competitive with conventional SfM like Colmap! IMO this solves a major missing piece for internet-scale training of 3D Deep Learning methods. 1/n

Introducing “FlowMap”, the first self-supervised, differentiable structure-from-motion method that is competitive with conventional SfM like Colmap! IMO this solves a major missing piece for internet-scale training of 3D Deep Learning methods. 1/n

Vincent Sitzmann

128,565 views • 2 years ago

Pixie Fast and Generalizable Supervised Learning of 3D Physics from Pixels

Pixie Fast and Generalizable Supervised Learning of 3D Physics from Pixels

AK

21,434 views • 9 months ago

Meet LA-Pose. Our latest model taking Wayve another step towards generalization at scale. LA-Pose employs large-scale self-supervised learning, building strong motion representations for 3D perception from 10.2 million unlabeled driving video snippets, unlike today's strongest approaches that often depend on expensive, carefully curated 3D supervision. With only a lightweight pose head and limited labelled data, LA-Pose achieves: 📷 State-of-the-art camera pose estimation 🌎 Strong zero-shot generalization across diverse driving scenarios 🏷️ Orders of magnitude less labelled data than fully supervised 3D approaches Our full blog post: Explore the full paper here:

Meet LA-Pose. Our latest model taking Wayve another step towards generalization at scale. LA-Pose employs large-scale self-supervised learning, building strong motion representations for 3D perception from 10.2 million unlabeled driving video snippets, unlike today's strongest approaches that often depend on expensive, carefully curated 3D supervision. With only a lightweight pose head and limited labelled data, LA-Pose achieves: 📷 State-of-the-art camera pose estimation 🌎 Strong zero-shot generalization across diverse driving scenarios 🏷️ Orders of magnitude less labelled data than fully supervised 3D approaches Our full blog post: Explore the full paper here:

Wayve

36,410 views • 1 month ago

Phidias A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion discuss: In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.

Phidias A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion discuss: In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.

AK

25,120 views • 1 year ago

HunyuanWorld-Voyager is here and fully open-source! The world’s first ultra-long-range world model with native 3D reconstruction, redefining AI-driven spatial intelligence for VR, gaming, and simulations. ✅Direct 3D Output: Exports point cloud videos to 3D formats without tools like COLMAP, enabling instant 3D application use. ✅Innovative 3D Memory: Introduces a scalable world caching mechanism, ensuring geometric consistency across any camera trajectory. ✅Top-Ranked Performance: #1 on Stanford’s WorldScore, excelling in video generation and 3D reconstruction benchmarks.( Built on HunyuanWorld 1.0, Voyager blends video generation with 3D modeling, delivering camera-controlled, high-fidelity RGB-D sequences. Control scenes via keyboard or joystick for unmatched 3D consistency. Explore now: 🌐Project Page: 🔗GitHub: 🤗HuggingFace: 📝Technical Details:

HunyuanWorld-Voyager is here and fully open-source! The world’s first ultra-long-range world model with native 3D reconstruction, redefining AI-driven spatial intelligence for VR, gaming, and simulations. ✅Direct 3D Output: Exports point cloud videos to 3D formats without tools like COLMAP, enabling instant 3D application use. ✅Innovative 3D Memory: Introduces a scalable world caching mechanism, ensuring geometric consistency across any camera trajectory. ✅Top-Ranked Performance: #1 on Stanford’s WorldScore, excelling in video generation and 3D reconstruction benchmarks.( Built on HunyuanWorld 1.0, Voyager blends video generation with 3D modeling, delivering camera-controlled, high-fidelity RGB-D sequences. Control scenes via keyboard or joystick for unmatched 3D consistency. Explore now: 🌐Project Page: 🔗GitHub: 🤗HuggingFace: 📝Technical Details:

Tencent Hy

198,207 views • 9 months ago

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks. Learn more about DINOv3 here:

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks. Learn more about DINOv3 here:

AI at Meta

899,498 views • 10 months ago

Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Contributions: • We propose STORM, the first feed-forward, self-supervised method for fast and accurate reconstruction of dynamic 3D scenes from sparse, multi-timestep, posed camera images. • Our bottom-up framework aggregates and transforms per-frame 3D Gaussian Splats into a cohesive scene representation, enabling self-supervised motion estimation. Furthermore, we introduce motion tokens that capture common motion primitives and regularize motion predictions, facilitating dynamic motion group segmentation without explicit motion or correspondence supervision. • We present several enhancements for in-the-wild scenarios, including sky modeling, camera exposure inconsistency handling, large novel-view extrapolation, and fine-grained human motions reconstruction, making STORM well-suited for real-world applications.

Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Contributions: • We propose STORM, the first feed-forward, self-supervised method for fast and accurate reconstruction of dynamic 3D scenes from sparse, multi-timestep, posed camera images. • Our bottom-up framework aggregates and transforms per-frame 3D Gaussian Splats into a cohesive scene representation, enabling self-supervised motion estimation. Furthermore, we introduce motion tokens that capture common motion primitives and regularize motion predictions, facilitating dynamic motion group segmentation without explicit motion or correspondence supervision. • We present several enhancements for in-the-wild scenarios, including sky modeling, camera exposure inconsistency handling, large novel-view extrapolation, and fine-grained human motions reconstruction, making STORM well-suited for real-world applications.

MrNeRF

53,292 views • 1 year ago

Introducing FLARE #CVPR2026 2025 FLARE is a feed-forward model that simultaneously estimates high-quality camera poses, 3D geometry, and appearance from sparse uncalibrated images. 1/4

Introducing FLARE #CVPR2026 2025 FLARE is a feed-forward model that simultaneously estimates high-quality camera poses, 3D geometry, and appearance from sparse uncalibrated images. 1/4

Gordon Wetzstein

29,518 views • 1 year ago

Introducing Neural Jacobian Fields, robot 3D kinematic models learned only from vision! They can model & control robots from just a single RGB camera, even those w/ intractable kinematics & no embedded sensors such as soft, 3D-printed pneumatic hands! 1/n

Introducing Neural Jacobian Fields, robot 3D kinematic models learned only from vision! They can model & control robots from just a single RGB camera, even those w/ intractable kinematics & no embedded sensors such as soft, 3D-printed pneumatic hands! 1/n

Vincent Sitzmann

54,023 views • 1 year ago

We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs

We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs

Zhiwen(Aaron) Fan

14,895 views • 1 year ago

[1/N] Current visual geometry prediction models primarily rely on labeled 3D data. Our CVPR26 paper, Flow3r, allows additionally leveraging unlabeled videos (using flow supervision) for scalable visual geometry learning, enabling accurate multi-view 3D reconstruction in-the-wild.

[1/N] Current visual geometry prediction models primarily rely on labeled 3D data. Our CVPR26 paper, Flow3r, allows additionally leveraging unlabeled videos (using flow supervision) for scalable visual geometry learning, enabling accurate multi-view 3D reconstruction in-the-wild.

Shubham Tulsiani

15,974 views • 3 months ago

"YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting" TL;DR: a unified 3D Gaussian splatting model that reconstructs high-quality scene geometry and camera poses from unposed/uncalibrated images in a single forward pass.

"YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting" TL;DR: a unified 3D Gaussian splatting model that reconstructs high-quality scene geometry and camera poses from unposed/uncalibrated images in a single forward pass.

Alexandre Morgand

14,839 views • 3 months ago

Starting the new year without human labeling 🎉!! Multimodal lidar-camera data is a gold mine of dense 3D geometry hiding in plain sight. For supervised pretraining and validation at scale at Torc-Robotics, we rely on fully automated pseudo-labeling pipelines. Exploiting geometric priors from temporally accumulated LiDAR maps and an iterative update rule enforces joint geometric–semantic consistency while detecting moving objects via inconsistencies. We achieve 3D semantic labels and 3D bounding boxes with human-like quality at 200m+ range required for highway driving. Paper: Exciting work with Torc-Robotics with Filippo Ghilotti, Samuel Brucker, Nahku Saidy, Matteo Matteucci, Mario Bijelic.

Starting the new year without human labeling 🎉!! Multimodal lidar-camera data is a gold mine of dense 3D geometry hiding in plain sight. For supervised pretraining and validation at scale at Torc-Robotics, we rely on fully automated pseudo-labeling pipelines. Exploiting geometric priors from temporally accumulated LiDAR maps and an iterative update rule enforces joint geometric–semantic consistency while detecting moving objects via inconsistencies. We achieve 3D semantic labels and 3D bounding boxes with human-like quality at 200m+ range required for highway driving. Paper: Exciting work with Torc-Robotics with Filippo Ghilotti, Samuel Brucker, Nahku Saidy, Matteo Matteucci, Mario Bijelic.

Felix Heide

18,202 views • 5 months ago

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

AK

84,940 views • 2 years ago

What if we can simulate an *interactive 3D world*, from a single image, in the wild, in real time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts env dynamics given RGB-D capture and robot actions. 🌐 from Stanford University NVIDIA

What if we can simulate an interactive 3D world, from a single image, in the wild, in real time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts env dynamics given RGB-D capture and robot actions. 🌐 from Stanford University NVIDIA

Wenlong Huang

272,592 views • 5 months ago

Wow. No more waiting hours for COLMAP / SfM — you get live 3D Gaussian Splats straight from unposed images, even for massive scenes. Your 3D scene and camera poses are ready right as you finish capturing. Inria team cooked again; code drops end of May.

Wow. No more waiting hours for COLMAP / SfM — you get live 3D Gaussian Splats straight from unposed images, even for massive scenes. Your 3D scene and camera poses are ready right as you finish capturing. Inria team cooked again; code drops end of May.

Bilawal Sidhu

48,480 views • 1 year ago

Love these old meets new vfx workflows 1. extract a still of the object you want to augment 2. use image-to-3d to make a 3d model of the object 3. use that 3d geometry for classical object tracking Then you can go wild with complete control

Love these old meets new vfx workflows 1. extract a still of the object you want to augment 2. use image-to-3d to make a 3d model of the object 3. use that 3d geometry for classical object tracking Then you can go wild with complete control

Bilawal Sidhu

33,824 views • 1 year ago

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 views • 1 year ago

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 views • 2 years ago