Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

🔥Excited to introduce CoDi-2! It follows complex multimodal-interleaved in-context instructions to generate any modalities (text, vision, audio) in zero/few-shot interactive way! Ziyi Yang Yang Liu Chenguang Zhu Mohit Bansal 🧵👇

Zineng Tang

1,510 subscribers

97,533 Aufrufe • vor 2 Jahren •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

10 Kommentare

Profilbild von Zineng Tang

Zineng Tangvor 2 Jahren

By aligning modalities with language for encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to understand complex modality-interleaved instructions and in-context examples and conduct zero-shot/few-shot multimodal generation.

Profilbild von Zineng Tang

Zineng Tangvor 2 Jahren

Trained on a large-scale generation dataset encompassing in-context multi-modal instructions across text, vision, and audio, CoDi-2 can follow interleaved in-context text-audio-vision prompts and can zero-shot/few-shot jointly generate multiple modalities.

Profilbild von Zineng Tang

Zineng Tangvor 2 Jahren

CoDi-2 also demonstrates a wide range of zero-shot abilities for image generation like reasoning, compositionality, instruction editing, exemplar learning, and subject driven generation, etc.

Profilbild von Zineng Tang

Zineng Tangvor 2 Jahren

CoDi-2 also demonstrates zero-shot/few-shot abilities for audio generation with intricate prompting like instruction editing and exemplar learning.

Profilbild von Zineng Tang

Zineng Tangvor 2 Jahren

CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing.

Profilbild von Zineng Tang

Zineng Tangvor 2 Jahren

Overall, CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions & producing multimodal outputs (in zero/few-shot way). @berkeley_ai @uncnlp @MSFTResearch

Profilbild von Zineng Tang

Zineng Tangvor 2 Jahren

As a reminder CoDi-1 will be presented at #NeurIPS2023, happy to chat about CoDi-1 and CoDi-2 in New Orleans! ->

Profilbild von Connor Shorten

Connor Shortenvor 2 Jahren

@yzy_ai @nlpyang @ChenguangZhu2 @mohitban47 generative-CoDi Weaviate module coming soon? @ZainHasan6 @antas_marcin 👀🔥

Profilbild von Wenmeng Zhou

Wenmeng Zhouvor 2 Jahren

@yzy_ai @nlpyang @ChenguangZhu2 @mohitban47 cool! multi-modality in and out is the future I believe but it seems futue is coming now

Profilbild von SGM

SGMvor 2 Jahren

@yzy_ai @nlpyang @ChenguangZhu2 @mohitban47 @ZinengTang Will this product have a version where people can use it too i.e through a webui etc could be very helpful to a lot of people for sound effects perhaps you guys can release a version of it with freemium+paid plans

Ähnliche Videos

🖼️🎞️🔊📄Excited to introduce Composable Diffusion (CoDi), a new generative-AI foundation model that can take any combo of input modalities & generate any combo of output modalities (text, audio, image, video)! Ziyi Yang Chenguang Zhu Mohit Bansal 🧵👇 #CoDi

🖼️🎞️🔊📄Excited to introduce Composable Diffusion (CoDi), a new generative-AI foundation model that can take any combo of input modalities & generate any combo of output modalities (text, audio, image, video)! Ziyi Yang Chenguang Zhu Mohit Bansal 🧵👇 #CoDi

Zineng Tang

105,269 Aufrufe • vor 3 Jahren

Super excited to introduce Gemma 4 12B! 💎 - Multimodal: audio, image, video, and text input - Novel architecture: we removed the multimodal encoders for a unified, streamlined arch - New MacOS desktop app powered by LiteRT - MTP support Excited to see what you build with it!

Super excited to introduce Gemma 4 12B! 💎 - Multimodal: audio, image, video, and text input - Novel architecture: we removed the multimodal encoders for a unified, streamlined arch - New MacOS desktop app powered by LiteRT - MTP support Excited to see what you build with it!

Omar Sanseviero

124,712 Aufrufe • vor 13 Tagen

GLM-4.6V can accept multimodal inputs of various types and automatically generate high-quality, structured image-text interleaved content.

GLM-4.6V can accept multimodal inputs of various types and automatically generate high-quality, structured image-text interleaved content.

Z.ai

13,655 Aufrufe • vor 6 Monaten

We’re excited to introduce Chai-2, a major breakthrough in molecular design. Chai-2 enables zero-shot antibody discovery in a 24-well plate, exceeding previous SOTA by >100x. Thread👇

We’re excited to introduce Chai-2, a major breakthrough in molecular design. Chai-2 enables zero-shot antibody discovery in a 24-well plate, exceeding previous SOTA by >100x. Thread👇

Chai Discovery

708,522 Aufrufe • vor 11 Monaten

Vision-language models can control robots, but what if the prompt is too complex for the robot to follow directly? We developed a way to get robots to “think through” complex instructions, feedback, and interjections. We call it the Hierarchical Interactive Robot (Hi Robot).

Vision-language models can control robots, but what if the prompt is too complex for the robot to follow directly? We developed a way to get robots to “think through” complex instructions, feedback, and interjections. We call it the Hierarchical Interactive Robot (Hi Robot).

Physical Intelligence

116,845 Aufrufe • vor 1 Jahr

Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

merve

28,014 Aufrufe • vor 1 Jahr

Excited to introduce Reka Vision, an agentic visual understanding and search platform. Transform your unstructured multimodal data into insights and actions.

Excited to introduce Reka Vision, an agentic visual understanding and search platform. Transform your unstructured multimodal data into insights and actions.

Reka

485,861 Aufrufe • vor 11 Monaten

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Xiang Yue

57,680 Aufrufe • vor 1 Jahr

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 Aufrufe • vor 1 Jahr

We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: Google’s largest and most capable AI model. Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵

We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: Google’s largest and most capable AI model. Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵

Google DeepMind

1,315,224 Aufrufe • vor 2 Jahren

One of the greatest actors of his generation, #JacksonYee has been widely praised for his new action film, which follows a security team investigating an infiltrated "spy," alongside Zhu Yilong, Song Jia, Yang Mi, Liu Shishi, and Liu Yaowen in the cast.

One of the greatest actors of his generation, #JacksonYee has been widely praised for his new action film, which follows a security team investigating an infiltrated "spy," alongside Zhu Yilong, Song Jia, Yang Mi, Liu Shishi, and Liu Yaowen in the cast.

Blossom ☂️ C-drama

19,640 Aufrufe • vor 3 Monaten

SIMA 2 is our most capable AI agent for virtual 3D worlds. 👾🌐 Powered by Gemini, it goes beyond following basic instructions to think, understand, and take actions in interactive environments – meaning you can talk to it through text, voice, or even images. Here’s how 🧵

SIMA 2 is our most capable AI agent for virtual 3D worlds. 👾🌐 Powered by Gemini, it goes beyond following basic instructions to think, understand, and take actions in interactive environments – meaning you can talk to it through text, voice, or even images. Here’s how 🧵

Google DeepMind

1,919,651 Aufrufe • vor 7 Monaten

Multimodality: using #AI to understand & generate content across text, images, & audio. It can break down language barriers, making tech more inclusive. Which multimodal function do you find most impactful? Share your thoughts below 👇

Multimodality: using #AI to understand & generate content across text, images, & audio. It can break down language barriers, making tech more inclusive. Which multimodal function do you find most impactful? Share your thoughts below 👇

Google AI

36,204 Aufrufe • vor 1 Jahr

Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: 🧵👇

Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: 🧵👇

Jiafei Duan

107,760 Aufrufe • vor 3 Monaten

Today I'm launching Omnipilot! 🚀🚀🚀 It's an AI copilot that autocompletes text EVERYWHERE on MacOS. In every app!!! Another "copilot"???? This one is different, I swear! Omnipilot gathers context ambiently, and uses it to generate text in Notes, Gmail or Xcode Link: 🧵

Today I'm launching Omnipilot! 🚀🚀🚀 It's an AI copilot that autocompletes text EVERYWHERE on MacOS. In every app!!! Another "copilot"???? This one is different, I swear! Omnipilot gathers context ambiently, and uses it to generate text in Notes, Gmail or Xcode Link: 🧵

Michael Jelly 🏴‍☠️

28,917 Aufrufe • vor 2 Jahren

We’re excited to introduce Tutorials, a new section in the Framer Marketplace. If you’ve mastered a complex effect or discovered a faster way to build layouts, you now have a dedicated place to share it

We’re excited to introduce Tutorials, a new section in the Framer Marketplace. If you’ve mastered a complex effect or discovered a faster way to build layouts, you now have a dedicated place to share it

Framer

36,990 Aufrufe • vor 4 Monaten

Super excited for the release of Robot Utility Models (RUMs)! RUMs is a simple method to build zero-shot robot policies that can solve useful tasks in completely new homes without any additional training often at 90%+ success rate. 🧵👇

Super excited for the release of Robot Utility Models (RUMs)! RUMs is a simple method to build zero-shot robot policies that can solve useful tasks in completely new homes without any additional training often at 90%+ success rate. 🧵👇

Lerrel Pinto

56,591 Aufrufe • vor 1 Jahr

Google Opal is vastly underrated but can replace n8n or Make in many situations You can create powerful AI workflows for free and with a single prompt: - Ask the user for any input - Run multiple deep research simultaneously - Add any context you want (YT video, text, docs...) - Use tools like web, Maps, code execution - Synthesize with Gemini 2.5 Pro - Generate results in any format I created this agent to prepare meetings in literally 10 minutes. Tip: make sure to use the microphone feature to give as much context as possible to Opal... So it can generate the workflow almost in one-shot!

Google Opal is vastly underrated but can replace n8n or Make in many situations You can create powerful AI workflows for free and with a single prompt: - Ask the user for any input - Run multiple deep research simultaneously - Add any context you want (YT video, text, docs...) - Use tools like web, Maps, code execution - Synthesize with Gemini 2.5 Pro - Generate results in any format I created this agent to prepare meetings in literally 10 minutes. Tip: make sure to use the microphone feature to give as much context as possible to Opal... So it can generate the workflow almost in one-shot!

Paul Couvert

67,365 Aufrufe • vor 7 Monaten

KLING O3 JUST LANDED IN HEDRA – DAY 0 🔥 Unified multimodal monster: 15s videos, native audio baked in, 10+ refs, text/image/video-to-video, insane consistency & creative control. Hedra got it first. You got it now.

KLING O3 JUST LANDED IN HEDRA – DAY 0 🔥 Unified multimodal monster: 15s videos, native audio baked in, 10+ refs, text/image/video-to-video, insane consistency & creative control. Hedra got it first. You got it now.

Hedra

12,375 Aufrufe • vor 4 Monaten

Unstoppable.CLAY We're excited to announce that we have partnered with Unstoppable Domains to introduce .CLAY domains! 🧵👇

Unstoppable.CLAY We're excited to announce that we have partnered with Unstoppable Domains to introduce .CLAY domains! 🧵👇

Clay Nation

30,316 Aufrufe • vor 2 Jahren