Loading video...

Video Failed to Load

Go Home

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled assets and open-ended language instructions 1/n

92,545 views • 1 year ago •via X (Twitter)

11 Comments

Fan-Yun Sun's profile picture
Fan-Yun Sun1 year ago

Due to the lack of 3D and dimensional awareness in LLMs, existing methods struggle to generate scenes that are 🔹physically plausible (i.e., no collision) 🔹semantically aligned (i.e., objects are placed meaningfully according to the language instruction) 2/n

Fan-Yun Sun's profile picture
Fan-Yun Sun1 year ago

Our key idea: Use a VLM to produce two complementary representations and enforce mutual consistency for better spatial reasoning. 🔹 Initialization: predict numerical poses from visually marked multi-view images 🔹 Optimization: generate spatial relations as differentiable objectives 3/n

Fan-Yun Sun's profile picture
Fan-Yun Sun1 year ago

The 3D layout optimization landscape is full of local minima—how can we escape them? 🔹 We refine the optimization objectives by validating them against the predicted numerical initialization (code is verifiable!). 🔹 We further finetune our VLM on human-designed 3D scene datasets (i.e., 3D-FRONT) 4/n

Fan-Yun Sun's profile picture
Fan-Yun Sun1 year ago

LayoutVLM outperforms existing methods in our benchmark, where models arrange up to *80* 3D assets given a language instruction and a floor plan. 5/n

Fan-Yun Sun's profile picture
Fan-Yun Sun1 year ago

Automated 3D layout generation unlocks richer simulation environments for robotics and embodied AI, enabling: 🔹 More realistic scenes and layouts during training 🔹 Improved generalization for real-world deployment Consider scene_synthesizer by @clembow, which shares a similar purpose 6/n

Fan-Yun Sun's profile picture
Fan-Yun Sun1 year ago

Beyond research, consider this: Rockstar Games spends $100M+ and countless human hours meticulously placing 3D assets to create immersive game worlds like GTA. When combined with asset generation models, a model that can spatially reason could automate content creation for gaming, VR/AR, film production, etc. 7/n

Fan-Yun Sun's profile picture
Fan-Yun Sun1 year ago

Huge thanks to the amazing team: @Weiyu_Liu_ (co-lead), Siyi Gu, @dill_pkl , Goutam Bhat, @fedassa , @ManlingLi_ , @nickhaber , @jiajunwu_cs 🌐Project site: 💻 Code (we plan to open-source everything): n/n

Eli Schwartz's profile picture
Eli Schwartz1 year ago

Many sites are too focused trying to chase the PHD level of SEO best practices when, in reality, they would get the most value from just getting their basics right. Internal links is a basic:

stoey's profile picture
stoey1 year ago

wonder if i can test out some Sprinter van interior modifications with this…

Rich S. Sasuj's profile picture
Rich S. Sasuj1 year ago

How does this help industries like construction using revit in a real world scenario?

Scherazad Falcon's profile picture
Scherazad Falcon1 year ago

How about arranging them in your mind just thinking & visualizing.

Related Videos

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 views • 2 years ago