Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

🚀 Our NeurIPS '24 work, Large Spatial Model (LSM), is here! LSM performs semantic 3D reconstruction in just 0.1s, processing unposed data via feed-forward 3D reconstruction. 👉It leverages large-scale 3D datasets with minimal annotations, defining a 3D latent space. We are continuously exploring how this explicit 3D representation can... show more

Zhiwen(Aaron) Fan

1,813 subscribers

43,826 views • 1 year ago •via X (Twitter)

Science & Technology Education Health & Wellness

Anya Rossi• Live Now

Private livecam show

4 Comments

Zhiwen(Aaron) Fan1 year ago

It’s been an unforgettable collaboration with everyone as we discussed and converged on this exciting direction! None of this would have been possible without each of you. Looking forward to expanding the #LargeSpatialModel’s capabilities even further soon! Jian Zhang, @CongWenyan0320 , @peihao_wang, Renjie Li, @KairunWen , @ShijieZhoucla , @AchutaKadambi , Zhangyang Wang, @danfei_xu , @iamborisi , @drmapavone , @yuewang314

Boe1 year ago

The demo page doesn’t let you upload any kind of photos on iOS. The only input method is through the files app, and .png, .jpg, .heif are not supported.

berkshiremystery1 year ago

Seems good designed to work in the #PLTR AIP ontology, as an application 🤫🤔💭🙉 augmenting reality into simulation 🤗🚀 @chadwahl @david_marra

Michael Yuan1 year ago

wow

Related Videos

Thrilled to share our new work on Reconviagen✨! A key challenge in 3D creation is the alignment of generative 3D with observational input. Our method solves this by grounding the generative process in 3D reconstruction. Try it at: #Reconstruction #AIGC

Thrilled to share our new work on Reconviagen✨! A key challenge in 3D creation is the alignment of generative 3D with observational input. Our method solves this by grounding the generative process in 3D reconstruction. Try it at: #Reconstruction #AIGC

Chongjie(CJ) Ye

11,614 views • 9 months ago

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Gordon Wetzstein

19,189 views • 2 years ago

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,417 views • 8 months ago

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,494 views • 2 years ago

Large-scale 3D Scene Generation (all scenes are real-time rendered)!! Physically-grounded generative data without hallucinations is the missing link for robot learning and testing at scale. We introduce a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal view synthesis and generation with object permanence and explicit 3D geometry. This also allows for extreme trajectory extrapolation without failure! We also show that we can build fully data-driven simulators for end-to-end learning with this approach. Project: with the amazing team of Julian Ost, Amogh Joshi , Andrea Ramazzina, Maximilian Bömer, Mario Bijelic.

Large-scale 3D Scene Generation (all scenes are real-time rendered)!! Physically-grounded generative data without hallucinations is the missing link for robot learning and testing at scale. We introduce a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal view synthesis and generation with object permanence and explicit 3D geometry. This also allows for extreme trajectory extrapolation without failure! We also show that we can build fully data-driven simulators for end-to-end learning with this approach. Project: with the amazing team of Julian Ost, Amogh Joshi , Andrea Ramazzina, Maximilian Bömer, Mario Bijelic.

Felix Heide

27,736 views • 9 months ago

⚡️ Excited to announce Fast3R: 3D reconstruction of 1000+ images in a single forward pass! Fast3R achieves 251 FPS at its peak. 🔥 Try the demo with your images or video! 🔗 Website: 🎮 Demo: #CVPR2025 #3D AI at Meta

⚡️ Excited to announce Fast3R: 3D reconstruction of 1000+ images in a single forward pass! Fast3R achieves 251 FPS at its peak. 🔥 Try the demo with your images or video! 🔗 Website: 🎮 Demo: #CVPR2025 #3D AI at Meta

Jianing “Jed” Yang

71,719 views • 1 year ago

Scaling 3D scene data is a long-standing challenge in scene understanding, spatial reasoning, and robotics. Since scanning, reconstruction, and labeling are so labor-intensive, data scarcity has remained a major bottleneck. 🛑 To solve this, we propose SceneVerse++: Lifting Unlabeled Internet-level Data for 3D Scene Understanding (CVPR 2026). By reconstructing internet videos and annotating 3D scenes automatically, we’ve created a massive real-world dataset for end-to-end understanding. 🌐📐 SceneVerse++ makes it easy to scale "in-the-wild" 3D scenes toward more capable spatial reasoning systems. This significantly promotes progress in 3D VQA, visual navigation, and broader tasks in Embodied AI and Robotics. 🤖🦾 We are fully open-sourced! Check out the paper, code, and data here: 🌐 Project: 📄 Paper: 📊 Dataset: Code:

Scaling 3D scene data is a long-standing challenge in scene understanding, spatial reasoning, and robotics. Since scanning, reconstruction, and labeling are so labor-intensive, data scarcity has remained a major bottleneck. 🛑 To solve this, we propose SceneVerse++: Lifting Unlabeled Internet-level Data for 3D Scene Understanding (CVPR 2026). By reconstructing internet videos and annotating 3D scenes automatically, we’ve created a massive real-world dataset for end-to-end understanding. 🌐📐 SceneVerse++ makes it easy to scale "in-the-wild" 3D scenes toward more capable spatial reasoning systems. This significantly promotes progress in 3D VQA, visual navigation, and broader tasks in Embodied AI and Robotics. 🤖🦾 We are fully open-sourced! Check out the paper, code, and data here: 🌐 Project: 📄 Paper: 📊 Dataset: Code:

Siyuan Huang

12,433 views • 1 month ago

💥 Think more real data is needed for scene reconstruction? Think again! Meet MegaSynth: scaling up feed-forward 3D scene reconstruction with synthesized scenes. In 3 days, it generates 700K scenes for training—70x larger than real data! ✨ The secret? Reconstruction is mostly non-semantic! No need to rely heavily on real or highly realistic synthetic data. 🌐 Project: (1/4)

💥 Think more real data is needed for scene reconstruction? Think again! Meet MegaSynth: scaling up feed-forward 3D scene reconstruction with synthesized scenes. In 3 days, it generates 700K scenes for training—70x larger than real data! ✨ The secret? Reconstruction is mostly non-semantic! No need to rely heavily on real or highly realistic synthetic data. 🌐 Project: (1/4)

Hanwen Jiang

26,832 views • 1 year ago

More and more users are turning their 360 footage into fully explorable 3D Gaussian Splats with View it here: This reconstruction started from a simple capture and is now a fully explorable 3D Gaussian Splat rendered in the FreeGaussian viewer, complete with point cloud reconstruction and high-fidelity spatial detail. More and more creators are discovering that their existing footage can become immersive 3D scenes. And they’re doing it for free with FreeGaussian

More and more users are turning their 360 footage into fully explorable 3D Gaussian Splats with View it here: This reconstruction started from a simple capture and is now a fully explorable 3D Gaussian Splat rendered in the FreeGaussian viewer, complete with point cloud reconstruction and high-fidelity spatial detail. More and more creators are discovering that their existing footage can become immersive 3D scenes. And they’re doing it for free with FreeGaussian

Over the Reality 🌐

14,521 views • 1 month ago

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 views • 1 year ago

⚡🎉 We are thrilled to introduce VORTEX, an AI-powered computational framework for predicting 3D Spatial Transcriptomics (ST) using 3D tissue images and minimal 2D ST! 🧬 By combining cutting-edge 3D non-destructive tissue imaging with AI, VORTEX imputes the 3D molecular landscape of large tissue samples in a cost-effective and scalable manner. 🧠💡Our approach: By pretraining on diverse 3D morphology–2D transcriptomic pairs from heterogeneous tissue samples, and then fine-tuning on minimal 2D ST data from a volume of interest, VORTEX leverages both generic tissue-specific and sample-specific morphomolecular correlates to predict 3D ST. Congratulations to our superstar co-leads Cristina Almagro Pérez and Andrew H. Song, this was an exciting collaboration with Jonathan Liu Sizun Jiang Ali Bashashati. Preprint: Demo: Read the excellent blog from our superstar grad student Cristina Almagro Pérez: Also see our previous work on 3D Computational Pathology from Andrew H. Song published in Cell last year: Stay tuned for more to come.

⚡🎉 We are thrilled to introduce VORTEX, an AI-powered computational framework for predicting 3D Spatial Transcriptomics (ST) using 3D tissue images and minimal 2D ST! 🧬 By combining cutting-edge 3D non-destructive tissue imaging with AI, VORTEX imputes the 3D molecular landscape of large tissue samples in a cost-effective and scalable manner. 🧠💡Our approach: By pretraining on diverse 3D morphology–2D transcriptomic pairs from heterogeneous tissue samples, and then fine-tuning on minimal 2D ST data from a volume of interest, VORTEX leverages both generic tissue-specific and sample-specific morphomolecular correlates to predict 3D ST. Congratulations to our superstar co-leads Cristina Almagro Pérez and Andrew H. Song, this was an exciting collaboration with Jonathan Liu Sizun Jiang Ali Bashashati. Preprint: Demo: Read the excellent blog from our superstar grad student Cristina Almagro Pérez: Also see our previous work on 3D Computational Pathology from Andrew H. Song published in Cell last year: Stay tuned for more to come.

Faisal Mahmood

17,969 views • 1 year ago

3D Gaussian Splatting is great, but can it work without the pre-computed camera poses? Introducing: COLMAP-Free 3D Gaussian Splatting Our recent work shows not only it can, but 3D Gaussians make camera pose estimation easy (compared to NeRF) along with reconstruction. 👇🧵

3D Gaussian Splatting is great, but can it work without the pre-computed camera poses? Introducing: COLMAP-Free 3D Gaussian Splatting Our recent work shows not only it can, but 3D Gaussians make camera pose estimation easy (compared to NeRF) along with reconstruction. 👇🧵

Xiaolong Wang

76,747 views • 2 years ago

🚀Turn Single Image into 3D Human🚀 #GeneMAN is a generalizable single-image 3D human reconstruction framework that turns in-the-wild images into high-quality 3D humans with ease 🔗Project: 📜Paper: 🧑‍💻Code:

🚀Turn Single Image into 3D Human🚀 #GeneMAN is a generalizable single-image 3D human reconstruction framework that turns in-the-wild images into high-quality 3D humans with ease 🔗Project: 📜Paper: 🧑‍💻Code:

Ziwei Liu

26,953 views • 1 year ago

Can we synthesize 3D human-scene interactions without learning from any 3D data? Yes! Check out Lei Li's GenZI, a novel zero-shot approach to generating 3D interactions by distilling priors from large vision-language models.

Can we synthesize 3D human-scene interactions without learning from any 3D data? Yes! Check out Lei Li's GenZI, a novel zero-shot approach to generating 3D interactions by distilling priors from large vision-language models.

Angela Dai

106,850 views • 2 years ago

🍺 LagerNVS (CVPR 2026) 🍺 LagerNVS is a generalizable, feed-forward, real-time Novel View Synthesis network which - performs rendering in real time, - generalizes to in-the-wild data, - works with and without known source cameras, - sets a new state-of-the-art among deterministic methods, - can be paired with a diffusion decoder for generative extrapolation. LagerNVS shows that 3D biases are useful for Novel View Synthesis but explicit 3D representations are not required to achieve them. We use 3D biases in (1) architecture design and (2) pre-training: (1) In NVS with explicit 3D representations (3DGS, NeRF) reconstruction is typically difficult and slow, but rendering is much faster and simpler. We mimic this process in the network design: we use a large (1B params) encoder and a small, lightweight decoder (ViT-B). This allows increasing the network capacity while still achieving real-time rendering. (2) The encoder, initialized from VGGT, was pre-trained with 3D reconstruction objectives, making the initial features 3D aware. Both substantially improve performance. Project page: Code: Paper: Models: Work done with Jianyuan Minghao Chen Christian Rupprecht and Andrea Vedaldi

🍺 LagerNVS (CVPR 2026) 🍺 LagerNVS is a generalizable, feed-forward, real-time Novel View Synthesis network which - performs rendering in real time, - generalizes to in-the-wild data, - works with and without known source cameras, - sets a new state-of-the-art among deterministic methods, - can be paired with a diffusion decoder for generative extrapolation. LagerNVS shows that 3D biases are useful for Novel View Synthesis but explicit 3D representations are not required to achieve them. We use 3D biases in (1) architecture design and (2) pre-training: (1) In NVS with explicit 3D representations (3DGS, NeRF) reconstruction is typically difficult and slow, but rendering is much faster and simpler. We mimic this process in the network design: we use a large (1B params) encoder and a small, lightweight decoder (ViT-B). This allows increasing the network capacity while still achieving real-time rendering. (2) The encoder, initialized from VGGT, was pre-trained with 3D reconstruction objectives, making the initial features 3D aware. Both substantially improve performance. Project page: Code: Paper: Models: Work done with Jianyuan Minghao Chen Christian Rupprecht and Andrea Vedaldi

Stan Szymanowicz

31,395 views • 2 months ago

Introducing Meta Locate 3D: a model for accurate object localization in 3D environments. Learn how Meta Locate 3D can help robots accurately understand their surroundings and interact more naturally with humans. You can download the model and dataset, read our research paper, and even try a demo!

Introducing Meta Locate 3D: a model for accurate object localization in 3D environments. Learn how Meta Locate 3D can help robots accurately understand their surroundings and interact more naturally with humans. You can download the model and dataset, read our research paper, and even try a demo!

AI at Meta

81,287 views • 1 year ago

📢 Exciting news! Leafmap now supports creating 3D 🌎 maps with Mapbox! This demo showcases how to visualize the ESA global land cover from #EarthEngine on a 3D globe. Explore the 100 PB+ Earth Engine Data Catalog and display any geospatial datasets in 3D 🚀 Notebook: #leafmap #geospatial #opensource #python

📢 Exciting news! Leafmap now supports creating 3D 🌎 maps with Mapbox! This demo showcases how to visualize the ESA global land cover from #EarthEngine on a 3D globe. Explore the 100 PB+ Earth Engine Data Catalog and display any geospatial datasets in 3D 🚀 Notebook: #leafmap #geospatial #opensource #python

Qiusheng Wu

18,696 views • 1 year ago

Ultrasound-derived 3D reconstruction with zero manual processing (CARTO v8) 1. Rotate ICE probe 2. Acquire frames 3. Receive annotated, segmented 3D LA model (with LAA cut away already) compliments of deep learning algorithm Moussa Mansour

Ultrasound-derived 3D reconstruction with zero manual processing (CARTO v8) 1. Rotate ICE probe 2. Acquire frames 3. Receive annotated, segmented 3D LA model (with LAA cut away already) compliments of deep learning algorithm Moussa Mansour

Tom De Potter

11,759 views • 2 years ago

🔥 3D-LLMs go brrrr! 🚀 Excited to announce our latest research on scaling 3D-LLM training data to *million-scale* with *dense grounding*. 🌟 Introducing 3D-GRAND: a pioneering dataset featuring 40,087 household scenes paired with 6.2 million densely-grounded 3D-text pairs. 🏠💬 #LLM #3D #AI #ML #Robotics #EmbodiedAI

🔥 3D-LLMs go brrrr! 🚀 Excited to announce our latest research on scaling 3D-LLM training data to million-scale with dense grounding. 🌟 Introducing 3D-GRAND: a pioneering dataset featuring 40,087 household scenes paired with 6.2 million densely-grounded 3D-text pairs. 🏠💬 #LLM #3D #AI #ML #Robotics #EmbodiedAI

Jianing “Jed” Yang

27,626 views • 2 years ago

SAM 3D enables accurate 3D reconstruction from a single image, supporting real-world applications in editing, robotics, and interactive scene generation. Matt, a SAM 3D researcher, explains how the two-model design makes this possible for both people and complex environments. 🔗 Read the SAM 3D Objects research paper: 🔗 Read the SAM 3D Body research paper:

SAM 3D enables accurate 3D reconstruction from a single image, supporting real-world applications in editing, robotics, and interactive scene generation. Matt, a SAM 3D researcher, explains how the two-model design makes this possible for both people and complex environments. 🔗 Read the SAM 3D Objects research paper: 🔗 Read the SAM 3D Body research paper:

AI at Meta

17,858 views • 6 months ago