Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled assets and open-ended language instructions 1/n

Fan-Yun Sun

4,106 subscribers

92,545 Aufrufe • vor 1 Jahr •via X (Twitter)

Bildung Wissenschaft & Technologie #CVPR2025

Anya Rossi• Live Now

Private livecam show

11 Kommentare

Profilbild von Fan-Yun Sun

Fan-Yun Sunvor 1 Jahr

Due to the lack of 3D and dimensional awareness in LLMs, existing methods struggle to generate scenes that are 🔹physically plausible (i.e., no collision) 🔹semantically aligned (i.e., objects are placed meaningfully according to the language instruction) 2/n

Profilbild von Fan-Yun Sun

Fan-Yun Sunvor 1 Jahr

Our key idea: Use a VLM to produce two complementary representations and enforce mutual consistency for better spatial reasoning. 🔹 Initialization: predict numerical poses from visually marked multi-view images 🔹 Optimization: generate spatial relations as differentiable objectives 3/n

Profilbild von Fan-Yun Sun

Fan-Yun Sunvor 1 Jahr

The 3D layout optimization landscape is full of local minima—how can we escape them? 🔹 We refine the optimization objectives by validating them against the predicted numerical initialization (code is verifiable!). 🔹 We further finetune our VLM on human-designed 3D scene datasets (i.e., 3D-FRONT) 4/n

Profilbild von Fan-Yun Sun

Fan-Yun Sunvor 1 Jahr

LayoutVLM outperforms existing methods in our benchmark, where models arrange up to *80* 3D assets given a language instruction and a floor plan. 5/n

Profilbild von Fan-Yun Sun

Fan-Yun Sunvor 1 Jahr

Automated 3D layout generation unlocks richer simulation environments for robotics and embodied AI, enabling: 🔹 More realistic scenes and layouts during training 🔹 Improved generalization for real-world deployment Consider scene_synthesizer by @clembow, which shares a similar purpose 6/n

Profilbild von Fan-Yun Sun

Fan-Yun Sunvor 1 Jahr

Beyond research, consider this: Rockstar Games spends $100M+ and countless human hours meticulously placing 3D assets to create immersive game worlds like GTA. When combined with asset generation models, a model that can spatially reason could automate content creation for gaming, VR/AR, film production, etc. 7/n

Profilbild von Fan-Yun Sun

Fan-Yun Sunvor 1 Jahr

Huge thanks to the amazing team: @Weiyu_Liu_ (co-lead), Siyi Gu, @dill_pkl , Goutam Bhat, @fedassa , @ManlingLi_ , @nickhaber , @jiajunwu_cs 🌐Project site: 💻 Code (we plan to open-source everything): n/n

Profilbild von Eli Schwartz

Eli Schwartzvor 1 Jahr

Many sites are too focused trying to chase the PHD level of SEO best practices when, in reality, they would get the most value from just getting their basics right. Internal links is a basic:

Profilbild von stoey

stoeyvor 1 Jahr

wonder if i can test out some Sprinter van interior modifications with this…

Profilbild von Rich S. Sasuj

Rich S. Sasujvor 1 Jahr

How does this help industries like construction using revit in a real world scenario?

Profilbild von Scherazad Falcon

Scherazad Falconvor 1 Jahr

How about arranging them in your mind just thinking & visualizing.

Ähnliche Videos

What happens when VLMs meet 3D foundation models? See VLM-3R (CVPR 2026). VLM-3R links a vision-language model (e.g., Qwen) with 3D geometric foundation models (e.g., CUT3R) at metric scale. Given an uncalibrated video, it moves beyond pixels to perceive and reason in 3D space. Code (open source):

What happens when VLMs meet 3D foundation models? See VLM-3R (CVPR 2026). VLM-3R links a vision-language model (e.g., Qwen) with 3D geometric foundation models (e.g., CUT3R) at metric scale. Given an uncalibrated video, it moves beyond pixels to perceive and reason in 3D space. Code (open source):

Zhiwen(Aaron) Fan

10,625 Aufrufe • vor 3 Monaten

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

Fangchen Liu

68,312 Aufrufe • vor 1 Jahr

We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs

We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs

Zhiwen(Aaron) Fan

14,895 Aufrufe • vor 1 Jahr

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,435 Aufrufe • vor 1 Jahr

Scaling 3D scene data is a long-standing challenge in scene understanding, spatial reasoning, and robotics. Since scanning, reconstruction, and labeling are so labor-intensive, data scarcity has remained a major bottleneck. 🛑 To solve this, we propose SceneVerse++: Lifting Unlabeled Internet-level Data for 3D Scene Understanding (CVPR 2026). By reconstructing internet videos and annotating 3D scenes automatically, we’ve created a massive real-world dataset for end-to-end understanding. 🌐📐 SceneVerse++ makes it easy to scale "in-the-wild" 3D scenes toward more capable spatial reasoning systems. This significantly promotes progress in 3D VQA, visual navigation, and broader tasks in Embodied AI and Robotics. 🤖🦾 We are fully open-sourced! Check out the paper, code, and data here: 🌐 Project: 📄 Paper: 📊 Dataset: Code:

Scaling 3D scene data is a long-standing challenge in scene understanding, spatial reasoning, and robotics. Since scanning, reconstruction, and labeling are so labor-intensive, data scarcity has remained a major bottleneck. 🛑 To solve this, we propose SceneVerse++: Lifting Unlabeled Internet-level Data for 3D Scene Understanding (CVPR 2026). By reconstructing internet videos and annotating 3D scenes automatically, we’ve created a massive real-world dataset for end-to-end understanding. 🌐📐 SceneVerse++ makes it easy to scale "in-the-wild" 3D scenes toward more capable spatial reasoning systems. This significantly promotes progress in 3D VQA, visual navigation, and broader tasks in Embodied AI and Robotics. 🤖🦾 We are fully open-sourced! Check out the paper, code, and data here: 🌐 Project: 📄 Paper: 📊 Dataset: Code:

Siyuan Huang

12,612 Aufrufe • vor 2 Monaten

Can video generative models exhibit visuospatial intelligence? 🤔 Introducing Video4Spatial — a video-only framework that tackles spatial tasks. With just video context, our model can: 🔍 Ground objects by planning geometry-consistent paths 📸 Follow camera-pose instructions for scene navigation 🌐 Generalize to long contexts & unseen outdoor scenes A step toward video models as visual-spatial reasoners. Project: arXiv:

Can video generative models exhibit visuospatial intelligence? 🤔 Introducing Video4Spatial — a video-only framework that tackles spatial tasks. With just video context, our model can: 🔍 Ground objects by planning geometry-consistent paths 📸 Follow camera-pose instructions for scene navigation 🌐 Generalize to long contexts & unseen outdoor scenes A step toward video models as visual-spatial reasoners. Project: arXiv:

Xingang Pan

15,902 Aufrufe • vor 6 Monaten

How to harness foundation models for *generalization in the wild* in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

How to harness foundation models for generalization in the wild in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

Wenlong Huang

293,876 Aufrufe • vor 3 Jahren

Introducing “FlowMap”, the first self-supervised, differentiable structure-from-motion method that is competitive with conventional SfM like Colmap! IMO this solves a major missing piece for internet-scale training of 3D Deep Learning methods. 1/n

Introducing “FlowMap”, the first self-supervised, differentiable structure-from-motion method that is competitive with conventional SfM like Colmap! IMO this solves a major missing piece for internet-scale training of 3D Deep Learning methods. 1/n

Vincent Sitzmann

128,581 Aufrufe • vor 2 Jahren

Reasoning models tailored for diverse AI tasks. 🚀 Meet Amazon Nova 2 foundation models, supporting fast, cost-effective reasoning to multimodal capabilities. Power versatile tasks like AI agents, code-generation, and Conversational AI. Choose the perfect match for your workload.

Reasoning models tailored for diverse AI tasks. 🚀 Meet Amazon Nova 2 foundation models, supporting fast, cost-effective reasoning to multimodal capabilities. Power versatile tasks like AI agents, code-generation, and Conversational AI. Choose the perfect match for your workload.

Amazon Web Services

2,074,714 Aufrufe • vor 6 Monaten

Introducing CUT3R! An online 3D reasoning framework for many 3D tasks directly from just RGB. For static or dynamic scenes. Video or image collections, all in one!

Introducing CUT3R! An online 3D reasoning framework for many 3D tasks directly from just RGB. For static or dynamic scenes. Video or image collections, all in one!

Qianqian Wang

145,864 Aufrufe • vor 1 Jahr

📢 OneCanvas: 3D Scene Understanding via Panoramic Reprojection We extract features from video frames and reproject them into one occlusion-free view of the whole scene that a 2D VLM reads just like a normal image. We can center this view on any viewpoint, including an agent's own pose for situated reasoning. The same projection lets us create spatial training tasks with no human annotation, solvable only by reasoning over the 3D positions of real object features placed on an otherwise empty canvas. The result is a stock 2D VLM that reasons in 3D, setting a new state of the art across spatial benchmarks at far less compute. 🌐 ▶️ Great work by Bartłomiej Baranowski & Dave Zhenyu Chen

📢 OneCanvas: 3D Scene Understanding via Panoramic Reprojection We extract features from video frames and reproject them into one occlusion-free view of the whole scene that a 2D VLM reads just like a normal image. We can center this view on any viewpoint, including an agent's own pose for situated reasoning. The same projection lets us create spatial training tasks with no human annotation, solvable only by reasoning over the 3D positions of real object features placed on an otherwise empty canvas. The result is a stock 2D VLM that reasons in 3D, setting a new state of the art across spatial benchmarks at far less compute. 🌐 ▶️ Great work by Bartłomiej Baranowski & Dave Zhenyu Chen

Matthias Niessner

24,420 Aufrufe • vor 10 Tagen

What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇

What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇

Wenlong Huang

190,887 Aufrufe • vor 1 Jahr

Introducing Collaborative Reasoner: a framework to improve collaborative reasoning in language models. Collaborative Reasoner paves the way for developing social agents that can partner with humans and other agents. Read the research paper and download the code.

Introducing Collaborative Reasoner: a framework to improve collaborative reasoning in language models. Collaborative Reasoner paves the way for developing social agents that can partner with humans and other agents. Read the research paper and download the code.

AI at Meta

58,510 Aufrufe • vor 1 Jahr

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 Aufrufe • vor 2 Jahren

Excited to announce RoboCasa, a large-scale simulation framework of everyday tasks! We use generative AI tools to create diverse objects, scenes, and tasks. Simulation plays a pivotal role in our Data Pyramid for training generalist robots. Open-source at

Excited to announce RoboCasa, a large-scale simulation framework of everyday tasks! We use generative AI tools to create diverse objects, scenes, and tasks. Simulation plays a pivotal role in our Data Pyramid for training generalist robots. Open-source at

Yuke Zhu

141,486 Aufrufe • vor 2 Jahren

Using 3D assets with Gen-4 References is a simple way to bring highly detailed and specific models into your generative workflows for even more consistency and control. To do so, simply provide a location plate, a quick comp of your 3D model in that space and a style reference.

Using 3D assets with Gen-4 References is a simple way to bring highly detailed and specific models into your generative workflows for even more consistency and control. To do so, simply provide a location plate, a quick comp of your 3D model in that space and a style reference.

Runway

119,208 Aufrufe • vor 1 Jahr

Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 1/N

Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 1/N

Murtaza Dalal

114,538 Aufrufe • vor 1 Jahr

Introducing General Intuition and our $133.7M Seed from Khosla Ventures, General Catalyst, and Raine. We build foundation models and general agents for environments that require deep spatial and temporal reasoning.

Introducing General Intuition and our $133.7M Seed from Khosla Ventures, General Catalyst, and Raine. We build foundation models and general agents for environments that require deep spatial and temporal reasoning.

General Intuition

2,275,551 Aufrufe • vor 8 Monaten

We have seen a lot of legged robots doing navigation in the wild. But how about mobile manipulation in the wild? I have been pushing the direction of learning a unified, efficient, and dynamic 3D representation of scenes (for navigation) and objects (for manipulation) for the past two years. And now we have GeFF --- our large-scale, generalizable feature field, that combines the speed of a feed-forward neural network with the rich semantics from Foundation Models, to handle dynamically changing scenes, and enable open-ended, language-grounded scene and object understanding.

We have seen a lot of legged robots doing navigation in the wild. But how about mobile manipulation in the wild? I have been pushing the direction of learning a unified, efficient, and dynamic 3D representation of scenes (for navigation) and objects (for manipulation) for the past two years. And now we have GeFF --- our large-scale, generalizable feature field, that combines the speed of a feed-forward neural network with the rich semantics from Foundation Models, to handle dynamically changing scenes, and enable open-ended, language-grounded scene and object understanding.

Xiaolong Wang

42,767 Aufrufe • vor 2 Jahren

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 Aufrufe • vor 2 Jahren