正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs

Zhiwen(Aaron) Fan

1,808 subscribers

14,895 次观看 • 1 年前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

4 条评论

Lennie Budgell ❇️ 的头像

Lennie Budgell ❇️1 年前

This is some real great stuff I have been looking forward to seeing come into existence in such accuracy and types of usage. Finally. Thanks yall excited to get to playing around with the codr

PowerBeatsVR 的头像

PowerBeatsVR3 年前

VR fitness app PowerBeatsVR is NOW LIVE on the official Meta Quest store! Get fit in VR without any expensive subscription:

Wenbo Hu 的头像

Wenbo Hu1 年前

Great work! I have a general question about why CUT3R is preferred over VGGT for spatial encoder?

Zhiwen(Aaron) Fan 的头像

Zhiwen(Aaron) Fan1 年前

Great question. We’re aiming to equip VLMs with metric-scale geometric sensing.

相关视频

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled assets and open-ended language instructions 1/n

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled assets and open-ended language instructions 1/n

Fan-Yun Sun

92,545 次观看 • 1 年前

SpatialTrackerV2: unified, end-to-end 3D point tracking model which simultaneously estimates Camera Motion, Consistent Geometry and Pixel-wise 3D Trajectories.

SpatialTrackerV2: unified, end-to-end 3D point tracking model which simultaneously estimates Camera Motion, Consistent Geometry and Pixel-wise 3D Trajectories.

Bilawal Sidhu

20,346 次观看 • 11 个月前

Scaling 3D scene data is a long-standing challenge in scene understanding, spatial reasoning, and robotics. Since scanning, reconstruction, and labeling are so labor-intensive, data scarcity has remained a major bottleneck. 🛑 To solve this, we propose SceneVerse++: Lifting Unlabeled Internet-level Data for 3D Scene Understanding (CVPR 2026). By reconstructing internet videos and annotating 3D scenes automatically, we’ve created a massive real-world dataset for end-to-end understanding. 🌐📐 SceneVerse++ makes it easy to scale "in-the-wild" 3D scenes toward more capable spatial reasoning systems. This significantly promotes progress in 3D VQA, visual navigation, and broader tasks in Embodied AI and Robotics. 🤖🦾 We are fully open-sourced! Check out the paper, code, and data here: 🌐 Project: 📄 Paper: 📊 Dataset: Code:

Scaling 3D scene data is a long-standing challenge in scene understanding, spatial reasoning, and robotics. Since scanning, reconstruction, and labeling are so labor-intensive, data scarcity has remained a major bottleneck. 🛑 To solve this, we propose SceneVerse++: Lifting Unlabeled Internet-level Data for 3D Scene Understanding (CVPR 2026). By reconstructing internet videos and annotating 3D scenes automatically, we’ve created a massive real-world dataset for end-to-end understanding. 🌐📐 SceneVerse++ makes it easy to scale "in-the-wild" 3D scenes toward more capable spatial reasoning systems. This significantly promotes progress in 3D VQA, visual navigation, and broader tasks in Embodied AI and Robotics. 🤖🦾 We are fully open-sourced! Check out the paper, code, and data here: 🌐 Project: 📄 Paper: 📊 Dataset: Code:

Siyuan Huang

12,612 次观看 • 1 个月前

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 次观看 • 2 年前

Supervised learning has held 3D Vision back for too long. Meet RayZer — a self-supervised 3D model trained with zero 3D labels: ❌ No supervision of camera & geometry ✅ Just RGB images And the wild part? RayZer outperforms supervised methods (as 3D labels from COLMAP is noisy) 🌐 Project: (1/4)

Supervised learning has held 3D Vision back for too long. Meet RayZer — a self-supervised 3D model trained with zero 3D labels: ❌ No supervision of camera & geometry ✅ Just RGB images And the wild part? RayZer outperforms supervised methods (as 3D labels from COLMAP is noisy) 🌐 Project: (1/4)

Hanwen Jiang

69,527 次观看 • 1 年前

Today we’re releasing K2 Think V2, our most capable open-source reasoning model to date. This is a fully sovereign model: trained end-to-end on IFM-curated and synthesized data, with complete transparency from pre-training through final reasoning alignment.

Today we’re releasing K2 Think V2, our most capable open-source reasoning model to date. This is a fully sovereign model: trained end-to-end on IFM-curated and synthesized data, with complete transparency from pre-training through final reasoning alignment.

MBZUAI

287,724 次观看 • 4 个月前

NVIDIA Cosmos Reason 2 is here. 🥳 An open, highly accurate reasoning vision language model for physical AI, featuring: ✅ Improved spatio-temporal understanding and timestamp precision ✅ Flexible deployment with 2B and 8B model sizes ✅ Long-context reasoning with up to 256K tokens ✅ Expanded visual perception across complex environments We also have new Cosmos releases: Predict 2.5, Transfer 2.5, and the NVIDIA GR00T N1.6 robot foundation model. 📗Read our technical blog: 🤗 Download Cosmos Reason 2 on Hugging Face:

NVIDIA Cosmos Reason 2 is here. 🥳 An open, highly accurate reasoning vision language model for physical AI, featuring: ✅ Improved spatio-temporal understanding and timestamp precision ✅ Flexible deployment with 2B and 8B model sizes ✅ Long-context reasoning with up to 256K tokens ✅ Expanded visual perception across complex environments We also have new Cosmos releases: Predict 2.5, Transfer 2.5, and the NVIDIA GR00T N1.6 robot foundation model. 📗Read our technical blog: 🤗 Download Cosmos Reason 2 on Hugging Face:

NVIDIA AI Developer

45,677 次观看 • 5 个月前

Gemini Robotics 1.5 features a separate reasoning engine (ER), but its VLA model is also capable of thinking due to interleaved reasoning tokens. The VLA is able to independently operate long autonomous sequences (15+ minutes) without aid from the ER/VLM.

Gemini Robotics 1.5 features a separate reasoning engine (ER), but its VLA model is also capable of thinking due to interleaved reasoning tokens. The VLA is able to independently operate long autonomous sequences (15+ minutes) without aid from the ER/VLM.

The Humanoid Hub

34,214 次观看 • 7 个月前

Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art results 🚀 One universal model enables SoTA for: 🔥 Mono Depth Estimation 🔥 Multi-View SfM 🔥 Multi-View Stereo 🔥 Depth Completion 🔥 Registration … and many more possibilities! – plus everything is metric 🎯 We release code for data processing, training, benchmarking & ablations – everything Apache 2.0! Details & Links 👇

Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art results 🚀 One universal model enables SoTA for: 🔥 Mono Depth Estimation 🔥 Multi-View SfM 🔥 Multi-View Stereo 🔥 Depth Completion 🔥 Registration … and many more possibilities! – plus everything is metric 🎯 We release code for data processing, training, benchmarking & ablations – everything Apache 2.0! Details & Links 👇

Nikhil Keetha

122,648 次观看 • 9 个月前

Alibaba presents MIMO Controllable Character Video Synthesis with Spatial Decomposed Modeling Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.

Alibaba presents MIMO Controllable Character Video Synthesis with Spatial Decomposed Modeling Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.

AK

148,955 次观看 • 1 年前

InSpatio-WorldFM. transforms single photos into multi-view consistent 3D worlds. - Explicit 3D anchors + implicit neural state. - Zero-drift spatial reasoning. Real-time interactive exploration.

InSpatio-WorldFM. transforms single photos into multi-view consistent 3D worlds. - Explicit 3D anchors + implicit neural state. - Zero-drift spatial reasoning. Real-time interactive exploration.

Wildminder

10,808 次观看 • 3 个月前

HunyuanWorld-Voyager is here and fully open-source! The world’s first ultra-long-range world model with native 3D reconstruction, redefining AI-driven spatial intelligence for VR, gaming, and simulations. ✅Direct 3D Output: Exports point cloud videos to 3D formats without tools like COLMAP, enabling instant 3D application use. ✅Innovative 3D Memory: Introduces a scalable world caching mechanism, ensuring geometric consistency across any camera trajectory. ✅Top-Ranked Performance: #1 on Stanford’s WorldScore, excelling in video generation and 3D reconstruction benchmarks.( Built on HunyuanWorld 1.0, Voyager blends video generation with 3D modeling, delivering camera-controlled, high-fidelity RGB-D sequences. Control scenes via keyboard or joystick for unmatched 3D consistency. Explore now: 🌐Project Page: 🔗GitHub: 🤗HuggingFace: 📝Technical Details:

HunyuanWorld-Voyager is here and fully open-source! The world’s first ultra-long-range world model with native 3D reconstruction, redefining AI-driven spatial intelligence for VR, gaming, and simulations. ✅Direct 3D Output: Exports point cloud videos to 3D formats without tools like COLMAP, enabling instant 3D application use. ✅Innovative 3D Memory: Introduces a scalable world caching mechanism, ensuring geometric consistency across any camera trajectory. ✅Top-Ranked Performance: #1 on Stanford’s WorldScore, excelling in video generation and 3D reconstruction benchmarks.( Built on HunyuanWorld 1.0, Voyager blends video generation with 3D modeling, delivering camera-controlled, high-fidelity RGB-D sequences. Control scenes via keyboard or joystick for unmatched 3D consistency. Explore now: 🌐Project Page: 🔗GitHub: 🤗HuggingFace: 📝Technical Details:

Tencent Hy

198,207 次观看 • 9 个月前

New research with Tsinghua University: Spatial-TTT. A framework for streaming visual-based spatial intelligence with test-time training (TTT). Spatial-TTT adapts fast weights to capture and organize spatial evidence from long video streams, enabling models to build structured 3D spatial memory over time. Highlights: 🔹Efficient streaming memory. Fast weights act as compact spatial memory with sublinear memory growth over 7000+ frames and more than 40% lower compute. 🔹Spatial-predictive mechanism. TTT layers with 3D spatiotemporal convolution capture geometric correspondence and temporal continuity. 🔹SOTA results on long-horizon video spatial understanding (VSI-Bench). The paper ranked #1 on Hugging Face Daily Papers on March 13. Project page: GitHub: Paper: Model & Data:

New research with Tsinghua University: Spatial-TTT. A framework for streaming visual-based spatial intelligence with test-time training (TTT). Spatial-TTT adapts fast weights to capture and organize spatial evidence from long video streams, enabling models to build structured 3D spatial memory over time. Highlights: 🔹Efficient streaming memory. Fast weights act as compact spatial memory with sublinear memory growth over 7000+ frames and more than 40% lower compute. 🔹Spatial-predictive mechanism. TTT layers with 3D spatiotemporal convolution capture geometric correspondence and temporal continuity. 🔹SOTA results on long-horizon video spatial understanding (VSI-Bench). The paper ranked #1 on Hugging Face Daily Papers on March 13. Project page: GitHub: Paper: Model & Data:

Tencent Hy

20,792 次观看 • 3 个月前

Google and Meta just dropped a joint AI model that understands full 360° scene geometry. Depth, normals, sky masks, metric depth — state of the art. Code already public. THE SPATIAL COMPUTING INFRASTRUCTURE LAYER IS BEING QUIETLY BUILT.

Google and Meta just dropped a joint AI model that understands full 360° scene geometry. Depth, normals, sky masks, metric depth — state of the art. Code already public. THE SPATIAL COMPUTING INFRASTRUCTURE LAYER IS BEING QUIETLY BUILT.

0xMarioNawfal

39,927 次观看 • 8 天前

I'm excited to share our new work Align3R that estimates camera poses and consistent depth maps from a monocular video of a dynamic scene. Project page: Code: Paper:

I'm excited to share our new work Align3R that estimates camera poses and consistent depth maps from a monocular video of a dynamic scene. Project page: Code: Paper:

Yuan Liu

56,547 次观看 • 1 年前

🕹️We are excited to introduce "ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation" ChronoEdit reframes image editing as a video generation task to encourage temporal consistency. It leverages a temporal reasoning stage that denoises with “video reasoning tokens” to "reason" on physically plausible edits. See the attached video for results. Project Page: Arxiv: Code and model are coming.

🕹️We are excited to introduce "ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation" ChronoEdit reframes image editing as a video generation task to encourage temporal consistency. It leverages a temporal reasoning stage that denoises with “video reasoning tokens” to "reason" on physically plausible edits. See the attached video for results. Project Page: Arxiv: Code and model are coming.

Huan Ling

36,841 次观看 • 8 个月前

Introducing General Intuition and our $133.7M Seed from Khosla Ventures, General Catalyst, and Raine. We build foundation models and general agents for environments that require deep spatial and temporal reasoning.

Introducing General Intuition and our $133.7M Seed from Khosla Ventures, General Catalyst, and Raine. We build foundation models and general agents for environments that require deep spatial and temporal reasoning.

General Intuition

2,275,375 次观看 • 8 个月前

Fundamental spatial and temporal understanding is the bedrock upon which robots will learn motor control. These types of Embodied Reasoning capabilities enable policy learning but also inference abilities like image or video conditioning. We just released a new SOTA ER model!

Fundamental spatial and temporal understanding is the bedrock upon which robots will learn motor control. These types of Embodied Reasoning capabilities enable policy learning but also inference abilities like image or video conditioning. We just released a new SOTA ER model!

Ted Xiao

35,893 次观看 • 8 个月前

🚀 Introducing Articulate Anymesh – now open-sourced! An automated framework behind our Genesis simulator, capable of transforming any rigid 3D mesh into its articulated counterpart using an open-vocabulary manner! Given a 3D mesh, our framework uses VLMs + visual prompting to extract rich semantics — enabling part segmentation and functional joint construction automatically! 🔗 Code: 📄 Paper: 🌐 Project: #EmbodiedAI #3D #OpenSource #VLM #MeshProcessing

🚀 Introducing Articulate Anymesh – now open-sourced! An automated framework behind our Genesis simulator, capable of transforming any rigid 3D mesh into its articulated counterpart using an open-vocabulary manner! Given a 3D mesh, our framework uses VLMs + visual prompting to extract rich semantics — enabling part segmentation and functional joint construction automatically! 🔗 Code: 📄 Paper: 🌐 Project: #EmbodiedAI #3D #OpenSource #VLM #MeshProcessing

Chuang Gan

35,936 次观看 • 1 年前