正在加载视频...

视频加载失败

Can we synthesize 3D human-scene interactions without learning from any 3D data? Yes! Check out Lei Li's GenZI, a novel zero-shot approach to generating 3D interactions by distilling priors from large vision-language models.

106,850 次观看 • 2 年前 •via X (Twitter)

10 条评论

Michael Black 的头像
Michael Black2 年前

@craigleili Very creative! Love it.

Dan Casas 的头像
Dan Casas2 年前

@craigleili Great idea and super well presented. Love it!

ScottieFox 的头像
ScottieFox2 年前

@craigleili There must exist a vector for the opposite as well. Since the paper clearly shows an inpainting mask of human 2D interactions, then one could assume a "place this actor in a scene" - via the same text encoding.

Hongwei Yi 的头像
Hongwei Yi2 年前

@craigleili The idea and the results are super nice!!! Can't wait to use.

Thiemo Alldieck 的头像
Thiemo Alldieck2 年前

@craigleili creative idea!

Chenfanfu Jiang 的头像
Chenfanfu Jiang2 年前

@craigleili Inspiring

Dávid Komorowicz 的头像
Dávid Komorowicz2 年前

@craigleili Oh no, don't sit on the Guzheng😰

Chris Han 的头像
Chris Han2 年前

@craigleili @memdotai mem it

Leo 的头像
Leo2 年前

@craigleili so cool

Naureen Mahmood 的头像
Naureen Mahmood2 年前

@craigleili I really like the method presented here, not to mention the lovely video! Very nice work.

相关视频

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 次观看 • 2 年前