Загрузка видео...
Не удалось загрузить видео
We introduce 🔥X-InstructBLIP🔥, a simple and effective scalable cross-modal framework to empower LLMs to handle tasks across modalities such as text, image, video, sound, and 3D. Web: ArXiv: Code:
37,476 просмотров • 2 лет назад •via X (Twitter)
Комментарии: 5

We extend InstructBLIP’s instruction-aware representations beyond images to 3D, audio, and video. Despite the lack of modality-specific pre-training, X-InstructBLIP achieves comparable performance to SoTA models on a variety of out-of-domain tasks and modalities.

Despite the lack of joint modality training and distinct frozen pre-trained encoders for each modality, X-InstructBLIP demonstrates emergent capabilities in cross-modal comprehension.

To evaluate its abilities we introduce a new Cross-modal Discriminative Reasoning benchmark (DisCRn): Given two distinct modality inputs, the model needs to select the entity that matches the property queried.

X-InstructBLIP outperforms a strong SoTA captioning baseline on the new DisCRn task by 6.3 and 3.2 points for image-3D and audio-video pairs respectively. Nevertheless, the task remains an open challenge.

Thanks to all awesome collaborators: @artemispng, @Le_Xue01, @realNingYu, @LiJunnan0409, @dongxuli_, @JotyShafiq, @stanleyran, @silviocinguetta and @jcniebles

