正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

🔥Excited to introduce CoDi-2! It follows complex multimodal-interleaved in-context instructions to generate any modalities (text, vision, audio) in zero/few-shot interactive way! Ziyi Yang Yang Liu Chenguang Zhu Mohit Bansal 🧵👇

Zineng Tang

1,510 subscribers

97,533 次观看 • 2 年前 •via X (Twitter)

教育科学技术

Anya Rossi• Live Now

Private livecam show

10 条评论

Zineng Tang 的头像

Zineng Tang2 年前

By aligning modalities with language for encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to understand complex modality-interleaved instructions and in-context examples and conduct zero-shot/few-shot multimodal generation.

Zineng Tang 的头像

Zineng Tang2 年前

Trained on a large-scale generation dataset encompassing in-context multi-modal instructions across text, vision, and audio, CoDi-2 can follow interleaved in-context text-audio-vision prompts and can zero-shot/few-shot jointly generate multiple modalities.

Zineng Tang 的头像

Zineng Tang2 年前

CoDi-2 also demonstrates a wide range of zero-shot abilities for image generation like reasoning, compositionality, instruction editing, exemplar learning, and subject driven generation, etc.

Zineng Tang 的头像

Zineng Tang2 年前

CoDi-2 also demonstrates zero-shot/few-shot abilities for audio generation with intricate prompting like instruction editing and exemplar learning.

Zineng Tang 的头像

Zineng Tang2 年前

CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing.

Zineng Tang 的头像

Zineng Tang2 年前

Overall, CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions & producing multimodal outputs (in zero/few-shot way). @berkeley_ai @uncnlp @MSFTResearch

Zineng Tang 的头像

Zineng Tang2 年前

As a reminder CoDi-1 will be presented at #NeurIPS2023, happy to chat about CoDi-1 and CoDi-2 in New Orleans! ->

Connor Shorten 的头像

Connor Shorten2 年前

@yzy_ai @nlpyang @ChenguangZhu2 @mohitban47 generative-CoDi Weaviate module coming soon? @ZainHasan6 @antas_marcin 👀🔥

Wenmeng Zhou 的头像

Wenmeng Zhou2 年前

@yzy_ai @nlpyang @ChenguangZhu2 @mohitban47 cool! multi-modality in and out is the future I believe but it seems futue is coming now

SGM 的头像

SGM2 年前

@yzy_ai @nlpyang @ChenguangZhu2 @mohitban47 @ZinengTang Will this product have a version where people can use it too i.e through a webui etc could be very helpful to a lot of people for sound effects perhaps you guys can release a version of it with freemium+paid plans

相关视频

🖼️🎞️🔊📄Excited to introduce Composable Diffusion (CoDi), a new generative-AI foundation model that can take any combo of input modalities & generate any combo of output modalities (text, audio, image, video)! Ziyi Yang Chenguang Zhu Mohit Bansal 🧵👇 #CoDi

🖼️🎞️🔊📄Excited to introduce Composable Diffusion (CoDi), a new generative-AI foundation model that can take any combo of input modalities & generate any combo of output modalities (text, audio, image, video)! Ziyi Yang Chenguang Zhu Mohit Bansal 🧵👇 #CoDi

Zineng Tang

105,269 次观看 • 3 年前

Super excited to introduce Gemma 4 12B! 💎 - Multimodal: audio, image, video, and text input - Novel architecture: we removed the multimodal encoders for a unified, streamlined arch - New MacOS desktop app powered by LiteRT - MTP support Excited to see what you build with it!

Super excited to introduce Gemma 4 12B! 💎 - Multimodal: audio, image, video, and text input - Novel architecture: we removed the multimodal encoders for a unified, streamlined arch - New MacOS desktop app powered by LiteRT - MTP support Excited to see what you build with it!

Omar Sanseviero

124,712 次观看 • 13 天前

GLM-4.6V can accept multimodal inputs of various types and automatically generate high-quality, structured image-text interleaved content.

GLM-4.6V can accept multimodal inputs of various types and automatically generate high-quality, structured image-text interleaved content.

Z.ai

13,655 次观看 • 6 个月前

We’re excited to introduce Chai-2, a major breakthrough in molecular design. Chai-2 enables zero-shot antibody discovery in a 24-well plate, exceeding previous SOTA by >100x. Thread👇

We’re excited to introduce Chai-2, a major breakthrough in molecular design. Chai-2 enables zero-shot antibody discovery in a 24-well plate, exceeding previous SOTA by >100x. Thread👇

Chai Discovery

708,522 次观看 • 11 个月前

Vision-language models can control robots, but what if the prompt is too complex for the robot to follow directly? We developed a way to get robots to “think through” complex instructions, feedback, and interjections. We call it the Hierarchical Interactive Robot (Hi Robot).

Vision-language models can control robots, but what if the prompt is too complex for the robot to follow directly? We developed a way to get robots to “think through” complex instructions, feedback, and interjections. We call it the Hierarchical Interactive Robot (Hi Robot).

Physical Intelligence

116,845 次观看 • 1 年前

Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

merve

28,014 次观看 • 1 年前

Excited to introduce Reka Vision, an agentic visual understanding and search platform. Transform your unstructured multimodal data into insights and actions.

Excited to introduce Reka Vision, an agentic visual understanding and search platform. Transform your unstructured multimodal data into insights and actions.

Reka

485,861 次观看 • 11 个月前

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Xiang Yue

57,680 次观看 • 1 年前

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 次观看 • 1 年前

We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: Google’s largest and most capable AI model. Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵

We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: Google’s largest and most capable AI model. Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵

Google DeepMind

1,315,220 次观看 • 2 年前

One of the greatest actors of his generation, #JacksonYee has been widely praised for his new action film, which follows a security team investigating an infiltrated "spy," alongside Zhu Yilong, Song Jia, Yang Mi, Liu Shishi, and Liu Yaowen in the cast.

One of the greatest actors of his generation, #JacksonYee has been widely praised for his new action film, which follows a security team investigating an infiltrated "spy," alongside Zhu Yilong, Song Jia, Yang Mi, Liu Shishi, and Liu Yaowen in the cast.

Blossom ☂️ C-drama

19,640 次观看 • 3 个月前

SIMA 2 is our most capable AI agent for virtual 3D worlds. 👾🌐 Powered by Gemini, it goes beyond following basic instructions to think, understand, and take actions in interactive environments – meaning you can talk to it through text, voice, or even images. Here’s how 🧵

SIMA 2 is our most capable AI agent for virtual 3D worlds. 👾🌐 Powered by Gemini, it goes beyond following basic instructions to think, understand, and take actions in interactive environments – meaning you can talk to it through text, voice, or even images. Here’s how 🧵

Google DeepMind

1,919,651 次观看 • 7 个月前

Multimodality: using #AI to understand & generate content across text, images, & audio. It can break down language barriers, making tech more inclusive. Which multimodal function do you find most impactful? Share your thoughts below 👇

Multimodality: using #AI to understand & generate content across text, images, & audio. It can break down language barriers, making tech more inclusive. Which multimodal function do you find most impactful? Share your thoughts below 👇

Google AI

36,204 次观看 • 1 年前

Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: 🧵👇

Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: 🧵👇

Jiafei Duan

107,760 次观看 • 3 个月前

Today I'm launching Omnipilot! 🚀🚀🚀 It's an AI copilot that autocompletes text EVERYWHERE on MacOS. In every app!!! Another "copilot"???? This one is different, I swear! Omnipilot gathers context ambiently, and uses it to generate text in Notes, Gmail or Xcode Link: 🧵

Today I'm launching Omnipilot! 🚀🚀🚀 It's an AI copilot that autocompletes text EVERYWHERE on MacOS. In every app!!! Another "copilot"???? This one is different, I swear! Omnipilot gathers context ambiently, and uses it to generate text in Notes, Gmail or Xcode Link: 🧵

Michael Jelly 🏴‍☠️

28,917 次观看 • 2 年前

We’re excited to introduce Tutorials, a new section in the Framer Marketplace. If you’ve mastered a complex effect or discovered a faster way to build layouts, you now have a dedicated place to share it

We’re excited to introduce Tutorials, a new section in the Framer Marketplace. If you’ve mastered a complex effect or discovered a faster way to build layouts, you now have a dedicated place to share it

Framer

36,990 次观看 • 4 个月前

Super excited for the release of Robot Utility Models (RUMs)! RUMs is a simple method to build zero-shot robot policies that can solve useful tasks in completely new homes without any additional training often at 90%+ success rate. 🧵👇

Super excited for the release of Robot Utility Models (RUMs)! RUMs is a simple method to build zero-shot robot policies that can solve useful tasks in completely new homes without any additional training often at 90%+ success rate. 🧵👇

Lerrel Pinto

56,591 次观看 • 1 年前

Google Opal is vastly underrated but can replace n8n or Make in many situations You can create powerful AI workflows for free and with a single prompt: - Ask the user for any input - Run multiple deep research simultaneously - Add any context you want (YT video, text, docs...) - Use tools like web, Maps, code execution - Synthesize with Gemini 2.5 Pro - Generate results in any format I created this agent to prepare meetings in literally 10 minutes. Tip: make sure to use the microphone feature to give as much context as possible to Opal... So it can generate the workflow almost in one-shot!

Google Opal is vastly underrated but can replace n8n or Make in many situations You can create powerful AI workflows for free and with a single prompt: - Ask the user for any input - Run multiple deep research simultaneously - Add any context you want (YT video, text, docs...) - Use tools like web, Maps, code execution - Synthesize with Gemini 2.5 Pro - Generate results in any format I created this agent to prepare meetings in literally 10 minutes. Tip: make sure to use the microphone feature to give as much context as possible to Opal... So it can generate the workflow almost in one-shot!

Paul Couvert

67,365 次观看 • 7 个月前

KLING O3 JUST LANDED IN HEDRA – DAY 0 🔥 Unified multimodal monster: 15s videos, native audio baked in, 10+ refs, text/image/video-to-video, insane consistency & creative control. Hedra got it first. You got it now.

KLING O3 JUST LANDED IN HEDRA – DAY 0 🔥 Unified multimodal monster: 15s videos, native audio baked in, 10+ refs, text/image/video-to-video, insane consistency & creative control. Hedra got it first. You got it now.

Hedra

12,375 次观看 • 4 个月前

Unstoppable.CLAY We're excited to announce that we have partnered with Unstoppable Domains to introduce .CLAY domains! 🧵👇

Unstoppable.CLAY We're excited to announce that we have partnered with Unstoppable Domains to introduce .CLAY domains! 🧵👇

Clay Nation

30,316 次观看 • 2 年前