Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the multitask learning aspect of...

69,241 görüntüleme • 2 yıl önce •via X (Twitter)

6 Yorum

Amir Zamir profil fotoğrafı
Amir Zamir2 yıl önce

shoutout to @roman__bachmann, @oguzhanthefatih, @dmizrahi_ who led the work, along with @aligarjani, @mingfei_gao, David Griffiths, @hujm99, @afshin_dn, @zamir_ar.

Shikun Liu profil fotoğrafı
Shikun Liu2 yıl önce

Great work! And thanks for open-sourcing to the community. :)

Isaac Kargar profil fotoğrafı
Isaac Kargar2 yıl önce

No audio input?

Amir Zamir profil fotoğrafı
Amir Zamir2 yıl önce

It’s a matter of data. Otherwise IMO the method will work as-is.

Lele profil fotoğrafı
Lele2 yıl önce

massive

Puneet (Linkedin Top Voice | AI and Data speaker) profil fotoğrafı
Puneet (Linkedin Top Voice | AI and Data speaker)2 yıl önce

@zamir_ar You guys are nailing it! #multimodal #framework

Benzer Videolar

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 görüntüleme • 1 yıl önce

Small Language Models (SML) are the future of AI. "Small" (SML) instead of "Large" (LLM). These small models are highly specialized models with superhuman abilities on specific tasks. Here are two techniques to build these models: • Spectrum • Model Merging I give you a short introduction in the attached video, but here is a quick summary: Spectrum helps us identify the most relevant layers to solve one specific task. We can ignore everything else and focus on fine-tuning these layers. Using Spectrum, we can fine-tune models in a heartbeat. Model Merging combines multiple models into a unique, much better model than any of the individual input models. You can also combine models specialized in different tasks and get a model with multiple abilities. This is the state of the art of productizing models. It's what Arcee.ai's platform does behind the scenes. Arcee collaborated with me on this post and is sponsoring it. There are three main steps to produce a model for your particular use case: 1. You create a dataset by uploading your data. 2. You train a model. At this step, Arcee uses Spectrum and Model Merging to produce a highly specialized model for your task. 3. You can deploy that model to any environment you want. Three important notes: • Training process is 2x faster and 2x cheaper than regular fine-tuning. • Resultant models are smaller and have higher accuracy. • They create these specialized models from open-source models. Check this site so you can fully appreciate how this works: If you want to fine-tune an open-source model, consider Arcee's platform. This is the state of the art.

Santiago

164,162 görüntüleme • 1 yıl önce