Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

How do we build multimodal systems that work effectively across the globe? 🌍 Today we release the Aya Vision Technical Report, the detailed recipe behind Aya Vision models, unifying state-of-the-art multilingual capabilities in multimodal and text tasks across 23 languages!

Cohere Labs

26,712 subscribers

15,572 views • 1 year ago •via X (Twitter)

Science & Technology News & Politics Education

Anya Rossi• Live Now

Private livecam show

9 Comments

Cohere Labs1 year ago

Our 8B model is best-in-class for its size outperforming models like Pixtral-12B and Pangea-7B. The compact Aya Vision-32B pushes efficiency further, outperforming models >2x larger like Llama3.2-90B & Molmo-72B! Setting a new Pareto frontier in multilingual multimodal AI. 💪

Cohere Labs1 year ago

How to build strong multimodal models for many languages where high-quality multimodal multilingual data is almost non-existent? We develop a novel synthetic annotation framework creating rich, human-preferred multimodal data in 23 languages! ✅

Cohere Labs1 year ago

Adding vision often degrades text-only skills (catastrophic forgetting!), especially across languages. 📉 Our novel cross-modal model merging technique fuses the original text LLM with the multimodal model, preserving text abilities and boosting multimodal win-rates! 🤝

Cohere Labs1 year ago

Current multimodal evals often miss the mark. 🤔 Too rigid, prompt-sensitive, & English-only, they don't capture real-world nuances. We also introduce Aya Vision Bench! 📊 Our new benchmark focuses on human preference across 23 languages & 9 tasks for better MLLM evaluation. 🌍

Cohere Labs1 year ago

Putting it all together for Aya Vision: each of our innovations boost Aya Vision’s performance, enabling SOTA performance: 💡 Synthetic data framework → +17.2% win rate (reaching 58.1%) 🤝 Cross-modal merging → +11.9% (reaching 70.0%) 🚀 Scaling to 32B → +9.1% (reaching 79.1%)

Cohere Labs1 year ago

As promised, the Aya Vision Technical Report showcases our commitment to open-science, and completes the release of Aya Vision models and Aya Vision Bench. 🌍 📜Paper link:

Cohere Labs1 year ago

Thank you to all authors: @TheyCallMeMr_, @YiyangNan, @johnamqdang , @aahmadian_, @singhshiviii, Madeline Smith, @bharatvenki, @vshmyhlo, @viraataryabumi, Walter Beller-Morales, Jeremy Pekmez, @TheOneKloud, @acyr_l , @nickfrosst, Phil Blunsom, @aidangomez, @1vnzh…

Cohere Labs1 year ago

…@mziizm, Manoj Govindassamy, @commit_xact, @mgalle, @beyzaermis, @ahmetustun89, and @sarahookr.

VistaShares1 year ago

The global AI sector is evolving rapidly, supported by advancements in technology and infrastructure. AIS offers targeted exposure to key players driving these developments.

Related Videos

Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model. This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task. Details ⬇️

Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model. This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task. Details ⬇️

AI at Meta

592,762 views • 2 years ago

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 views • 1 year ago

We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: Google’s largest and most capable AI model. Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵

We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: Google’s largest and most capable AI model. Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵

Google DeepMind

1,315,698 views • 2 years ago

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Andrew Ng

74,060 views • 1 year ago

🚀 Introducing Emu3.5 — a large-scale multimodal world model that natively predicts the next vision-language state. 🔥 Trained on over 10T interleaved vision-language tokens and enhanced with reinforcement learning, Emu3.5 achieves powerful multimodal reasoning and generation. ⚡ Powered by our new Discrete Diffusion Adaptation (DiDA) for 20× faster inference. 🔥 Emu3.5 outperforms Nano Banana across image generation, editing, interleaved tasks and more. 🌍 Explore Emu3.5: Github: #Emu3 #MultimodalAI #WorldModel #NextTokenPrediction

🚀 Introducing Emu3.5 — a large-scale multimodal world model that natively predicts the next vision-language state. 🔥 Trained on over 10T interleaved vision-language tokens and enhanced with reinforcement learning, Emu3.5 achieves powerful multimodal reasoning and generation. ⚡ Powered by our new Discrete Diffusion Adaptation (DiDA) for 20× faster inference. 🔥 Emu3.5 outperforms Nano Banana across image generation, editing, interleaved tasks and more. 🌍 Explore Emu3.5: Github: #Emu3 #MultimodalAI #WorldModel #NextTokenPrediction

BAAI

51,880 views • 9 months ago

Demis' dream digital coworker vision: an AI assistant "useful every moment of your life" for both work and leisure. It follows you across all devices - computer, phone, even smart glasses - understanding the physical world around you through multimodal capabilities.

Demis' dream digital coworker vision: an AI assistant "useful every moment of your life" for both work and leisure. It follows you across all devices - computer, phone, even smart glasses - understanding the physical world around you through multimodal capabilities.

Rowan Cheung

22,168 views • 8 months ago

[1/n] Do distinct large models admit a simple map that aligns their embedding spaces? We show that across multimodal contrastive models—trained on different data and architectures—an orthogonal map aligns image embeddings. Strikingly, the same map also aligns text embeddings.

[1/n] Do distinct large models admit a simple map that aligns their embedding spaces? We show that across multimodal contrastive models—trained on different data and architectures—an orthogonal map aligns image embeddings. Strikingly, the same map also aligns text embeddings.

Sharut Gupta

37,310 views • 5 months ago

How does an AI model actually learn to see? 🤖 Learn about the tech behind native multimodality, how models reason over visual data like documents and video, and the future of proactive AI assistants with Logan Kilpatrick and Gemini Model Behavior Product Lead, Ani Baddepudi. ↓ Timestamps: 01:12 Why Gemini is natively multimodal 02:23 The technology behind multimodal models 05:15 Video understanding with Gemini 2.5 09:25 Deciding what to build next 13:23 Building new product experiences with multimodal AI 17:15 The vision for proactive assistants 24:13 Improving video usability with variable FPS and frame tokenization 27:35 What’s next for Gemini’s multimodal development 31:47 Deep dive on Gemini’s document understanding capabilities 37:56 The teamwork and collaboration behind Gemini 40:56 What’s next with model behavior

How does an AI model actually learn to see? 🤖 Learn about the tech behind native multimodality, how models reason over visual data like documents and video, and the future of proactive AI assistants with Logan Kilpatrick and Gemini Model Behavior Product Lead, Ani Baddepudi. ↓ Timestamps: 01:12 Why Gemini is natively multimodal 02:23 The technology behind multimodal models 05:15 Video understanding with Gemini 2.5 09:25 Deciding what to build next 13:23 Building new product experiences with multimodal AI 17:15 The vision for proactive assistants 24:13 Improving video usability with variable FPS and frame tokenization 27:35 What’s next for Gemini’s multimodal development 31:47 Deep dive on Gemini’s document understanding capabilities 37:56 The teamwork and collaboration behind Gemini 40:56 What’s next with model behavior

Google AI

58,703 views • 1 year ago

Gemini 3 Pro is the best model in the world for multimodal understanding. One of its most exciting capabilities is document understanding and reasoning. This means you can convert information in any format and into the medium that works best for you. Gemini 3 also has leading multilingual capabilities, enabling it to process, reason and even capture cultural relevance across a variety of languages. For example, here Gemini 3 is translating handwritten recipes in Korean and English to build a digital family cookbook in different languages.

Gemini 3 Pro is the best model in the world for multimodal understanding. One of its most exciting capabilities is document understanding and reasoning. This means you can convert information in any format and into the medium that works best for you. Gemini 3 also has leading multilingual capabilities, enabling it to process, reason and even capture cultural relevance across a variety of languages. For example, here Gemini 3 is translating handwritten recipes in Korean and English to build a digital family cookbook in different languages.

Google AI

36,696 views • 8 months ago

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

Amir Zamir

73,074 views • 1 year ago

This International Day of Sign Languages, we celebrate the language that makes millions of people across the globe feel included!

This International Day of Sign Languages, we celebrate the language that makes millions of people across the globe feel included!

Axis Bank

570,065 views • 10 months ago

Meet Molmo: a family of open, state-of-the-art multimodal AI models. Our best model outperforms proprietary systems, using 1000x less data. Molmo doesn't just understand multimodal data—it acts on it, enabling rich interactions in both the physical and virtual worlds. Try it for yourself:

Meet Molmo: a family of open, state-of-the-art multimodal AI models. Our best model outperforms proprietary systems, using 1000x less data. Molmo doesn't just understand multimodal data—it acts on it, enabling rich interactions in both the physical and virtual worlds. Try it for yourself:

Ai2

515,522 views • 1 year ago

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Xiang Yue

57,699 views • 1 year ago

This is Gemini 3: our most intelligent model that helps you learn, build and plan anything. It comes with state-of-the-art reasoning capabilities, world-leading multimodal understanding, and enables new agentic coding experiences. 🧵

This is Gemini 3: our most intelligent model that helps you learn, build and plan anything. It comes with state-of-the-art reasoning capabilities, world-leading multimodal understanding, and enables new agentic coding experiences. 🧵

Google DeepMind

1,693,548 views • 8 months ago

We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the multitask learning aspect of multimodal models has really taken a step forward. We can train a single model on many diverse tasks with ~SOTA accuracy. But a long way to go in terms of transfer/emergence. 🌐 ⌨️ Joint work w/ EPFL Apple.

We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the multitask learning aspect of multimodal models has really taken a step forward. We can train a single model on many diverse tasks with ~SOTA accuracy. But a long way to go in terms of transfer/emergence. 🌐 ⌨️ Joint work w/ EPFL Apple.

Amir Zamir

69,564 views • 2 years ago

Empowering Perspectives, Amplifying Voices: Today, we celebrate the power of art to inspire, evoke emotions, and unite souls across the globe. Happy World Art Day! #WorldArtDay #AccessMore

Empowering Perspectives, Amplifying Voices: Today, we celebrate the power of art to inspire, evoke emotions, and unite souls across the globe. Happy World Art Day! #WorldArtDay #AccessMore

Access Bank Plc

13,555 views • 2 years ago

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,736 views • 1 year ago