Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

How do we build multimodal systems that work effectively across the globe? 🌍 Today we release the Aya Vision Technical Report, the detailed recipe behind Aya Vision models, unifying state-of-the-art multilingual capabilities in multimodal and text tasks across 23 languages!

15,561 görüntüleme • 1 yıl önce •via X (Twitter)

9 Yorum

Cohere Labs profil fotoğrafı
Cohere Labs1 yıl önce

Our 8B model is best-in-class for its size outperforming models like Pixtral-12B and Pangea-7B. The compact Aya Vision-32B pushes efficiency further, outperforming models >2x larger like Llama3.2-90B & Molmo-72B! Setting a new Pareto frontier in multilingual multimodal AI. 💪

Cohere Labs profil fotoğrafı
Cohere Labs1 yıl önce

How to build strong multimodal models for many languages where high-quality multimodal multilingual data is almost non-existent? We develop a novel synthetic annotation framework creating rich, human-preferred multimodal data in 23 languages! ✅

Cohere Labs profil fotoğrafı
Cohere Labs1 yıl önce

Adding vision often degrades text-only skills (catastrophic forgetting!), especially across languages. 📉 Our novel cross-modal model merging technique fuses the original text LLM with the multimodal model, preserving text abilities and boosting multimodal win-rates! 🤝

Cohere Labs profil fotoğrafı
Cohere Labs1 yıl önce

Current multimodal evals often miss the mark. 🤔 Too rigid, prompt-sensitive, & English-only, they don't capture real-world nuances. We also introduce Aya Vision Bench! 📊 Our new benchmark focuses on human preference across 23 languages & 9 tasks for better MLLM evaluation. 🌍

Cohere Labs profil fotoğrafı
Cohere Labs1 yıl önce

Putting it all together for Aya Vision: each of our innovations boost Aya Vision’s performance, enabling SOTA performance: 💡 Synthetic data framework → +17.2% win rate (reaching 58.1%) 🤝 Cross-modal merging → +11.9% (reaching 70.0%) 🚀 Scaling to 32B → +9.1% (reaching 79.1%)

Cohere Labs profil fotoğrafı
Cohere Labs1 yıl önce

As promised, the Aya Vision Technical Report showcases our commitment to open-science, and completes the release of Aya Vision models and Aya Vision Bench. 🌍 📜Paper link:

Cohere Labs profil fotoğrafı
Cohere Labs1 yıl önce

Thank you to all authors: @TheyCallMeMr_, @YiyangNan, @johnamqdang , @aahmadian_, @singhshiviii, Madeline Smith, @bharatvenki, @vshmyhlo, @viraataryabumi, Walter Beller-Morales, Jeremy Pekmez, @TheOneKloud, @acyr_l , @nickfrosst, Phil Blunsom, @aidangomez, @1vnzh…

Cohere Labs profil fotoğrafı
Cohere Labs1 yıl önce

…@mziizm, Manoj Govindassamy, @commit_xact, @mgalle, @beyzaermis, @ahmetustun89, and @sarahookr.

VistaShares profil fotoğrafı
VistaShares1 yıl önce

The global AI sector is evolving rapidly, supported by advancements in technology and infrastructure. AIS offers targeted exposure to key players driving these developments.

Benzer Videolar

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 görüntüleme • 1 yıl önce

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Andrew Ng

73,915 görüntüleme • 1 yıl önce

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

Amir Zamir

72,917 görüntüleme • 11 ay önce