Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

🔥Excited to introduce CoDi-2! It follows complex multimodal-interleaved in-context instructions to generate any modalities (text, vision, audio) in zero/few-shot interactive way! Ziyi Yang Yang Liu Chenguang Zhu Mohit Bansal 🧵👇

97,533 Aufrufe • vor 2 Jahren •via X (Twitter)

10 Kommentare

Profilbild von Zineng Tang
Zineng Tangvor 2 Jahren

By aligning modalities with language for encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to understand complex modality-interleaved instructions and in-context examples and conduct zero-shot/few-shot multimodal generation.

Profilbild von Zineng Tang
Zineng Tangvor 2 Jahren

Trained on a large-scale generation dataset encompassing in-context multi-modal instructions across text, vision, and audio, CoDi-2 can follow interleaved in-context text-audio-vision prompts and can zero-shot/few-shot jointly generate multiple modalities.

Profilbild von Zineng Tang
Zineng Tangvor 2 Jahren

CoDi-2 also demonstrates a wide range of zero-shot abilities for image generation like reasoning, compositionality, instruction editing, exemplar learning, and subject driven generation, etc.

Profilbild von Zineng Tang
Zineng Tangvor 2 Jahren

CoDi-2 also demonstrates zero-shot/few-shot abilities for audio generation with intricate prompting like instruction editing and exemplar learning.

Profilbild von Zineng Tang
Zineng Tangvor 2 Jahren

CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing.

Profilbild von Zineng Tang
Zineng Tangvor 2 Jahren

Overall, CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions & producing multimodal outputs (in zero/few-shot way). @berkeley_ai @uncnlp @MSFTResearch

Profilbild von Zineng Tang
Zineng Tangvor 2 Jahren

As a reminder CoDi-1 will be presented at #NeurIPS2023, happy to chat about CoDi-1 and CoDi-2 in New Orleans! ->

Profilbild von Connor Shorten
Connor Shortenvor 2 Jahren

@yzy_ai @nlpyang @ChenguangZhu2 @mohitban47 generative-CoDi Weaviate module coming soon? @ZainHasan6 @antas_marcin 👀🔥

Profilbild von Wenmeng Zhou
Wenmeng Zhouvor 2 Jahren

@yzy_ai @nlpyang @ChenguangZhu2 @mohitban47 cool! multi-modality in and out is the future I believe but it seems futue is coming now

Profilbild von SGM
SGMvor 2 Jahren

@yzy_ai @nlpyang @ChenguangZhu2 @mohitban47 @ZinengTang Will this product have a version where people can use it too i.e through a webui etc could be very helpful to a lot of people for sound effects perhaps you guys can release a version of it with freemium+paid plans

Ähnliche Videos

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 Aufrufe • vor 1 Jahr