Загрузка видео...

Не удалось загрузить видео

На главную

We introduce 🔥X-InstructBLIP🔥, a simple and effective scalable cross-modal framework to empower LLMs to handle tasks across modalities such as text, image, video, sound, and 3D. Web: ArXiv: Code:

37,476 просмотров • 2 лет назад •via X (Twitter)

Комментарии: 5

Фото профиля Caiming Xiong
Caiming Xiong2 лет назад

We extend InstructBLIP’s instruction-aware representations beyond images to 3D, audio, and video. Despite the lack of modality-specific pre-training, X-InstructBLIP achieves comparable performance to SoTA models on a variety of out-of-domain tasks and modalities.

Фото профиля Caiming Xiong
Caiming Xiong2 лет назад

Despite the lack of joint modality training and distinct frozen pre-trained encoders for each modality, X-InstructBLIP demonstrates emergent capabilities in cross-modal comprehension.

Фото профиля Caiming Xiong
Caiming Xiong2 лет назад

To evaluate its abilities we introduce a new Cross-modal Discriminative Reasoning benchmark (DisCRn): Given two distinct modality inputs, the model needs to select the entity that matches the property queried.

Фото профиля Caiming Xiong
Caiming Xiong2 лет назад

X-InstructBLIP outperforms a strong SoTA captioning baseline on the new DisCRn task by 6.3 and 3.2 points for image-3D and audio-video pairs respectively. Nevertheless, the task remains an open challenge.

Фото профиля Caiming Xiong
Caiming Xiong2 лет назад

Thanks to all awesome collaborators: @artemispng, @Le_Xue01, @realNingYu, @LiJunnan0409, @dongxuli_, @JotyShafiq, @stanleyran, @silviocinguetta and @jcniebles

Похожие видео

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 просмотров • 2 лет назад