Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Llama 3.2 features 11B & 90B models, our first multimodal Llama models with support for vision tasks. These models can take in both image and text prompts to deeply understand and reason on inputs.

AI at Meta

810,313 subscribers

121,530 views • 1 year ago •via X (Twitter)

Science & Technology Education News & Politics

Anya Rossi• Live Now

Private livecam show

10 Comments

AI at Meta1 year ago

Details on Llama 3.2 11B & 90B vision models — and the full collection of new Llama models ⬇️

Prashant1 year ago

can we get a standalone Meta AI app for Llama models. lot of people are unable to use it within Whatsapp, instagram or Messenger.

Ramon Guthrie1 year ago

Are these 11B & 90B models available in @ollama or @LMStudioAI yet?

Sourav Boyal1 year ago

When will I get access in India

Laurence Bremner1 year ago

Llama 3.2 is better than I expected by a long way

txh1 year ago

is this similar to openai CLIP where I can do embeds with it?

🐧 lalo adrian morales 𝕏1 year ago

you guys rule! thanks!

GPT.Biz1 year ago

Llama 3.2 sounds amazing! Love seeing the evolution into multimodal capabilities—excited for the new possibilities in vision and text understanding

VESSL AI1 year ago

💪🏻 VESSL supports Llama 3.2 now!

UA301 year ago

As AI revolutionizes the world in 2024, Facebook’s failure to offer support for deactivated accounts highlights a major gap in tech accountability.” #AIDrivenFuture #FacebookProblems

Related Videos

With Llama 3.2 we released our first-ever lightweight Llama models: 1B & 3B. These models empower developers to build personalized, on-device agentic applications with capabilities like summarization, tool use and RAG where data never leaves the device.

With Llama 3.2 we released our first-ever lightweight Llama models: 1B & 3B. These models empower developers to build personalized, on-device agentic applications with capabilities like summarization, tool use and RAG where data never leaves the device.

AI at Meta

153,348 views • 1 year ago

New short course: Prompt Engineering with Llama 2, built in collaboration with Meta AI at Meta, and taught by Amit Sangani! Meta's Llama 2 has been game-changing for AI. Building with open source lets you control your own data, scrutinize errors, update (or not) the models as you please, and work alongside the global community advancing open models. Llama isn't a single model, it's a collection of models. In this course, you'll: - Learn the differences between different Llama 2 flavors, and when to use each. - Prompt the Llama chat models -- you'll also see how Llama's instruction tags work -- so they can help you with day-to-day tasks, like writing or summarization. - Use advanced prompting, like few-shot prompting for classification, and chain-of-thought prompting for solving logic problems. - Use specialized models in the Llama collection for specific tasks, like Code Llama to help you write, analyze, and improve code, and Llama Guard, which checks prompts and model responses for harmful content. The course also touches on how to run Llama 2 locally on your own computer. I hope you’ll take this course and try out these powerful, open models!

New short course: Prompt Engineering with Llama 2, built in collaboration with Meta AI at Meta, and taught by Amit Sangani! Meta's Llama 2 has been game-changing for AI. Building with open source lets you control your own data, scrutinize errors, update (or not) the models as you please, and work alongside the global community advancing open models. Llama isn't a single model, it's a collection of models. In this course, you'll: - Learn the differences between different Llama 2 flavors, and when to use each. - Prompt the Llama chat models -- you'll also see how Llama's instruction tags work -- so they can help you with day-to-day tasks, like writing or summarization. - Use advanced prompting, like few-shot prompting for classification, and chain-of-thought prompting for solving logic problems. - Use specialized models in the Llama collection for specific tasks, like Code Llama to help you write, analyze, and improve code, and Llama Guard, which checks prompts and model responses for harmful content. The course also touches on how to run Llama 2 locally on your own computer. I hope you’ll take this course and try out these powerful, open models!

Andrew Ng

162,798 views • 2 years ago

"Introducing Multimodal Llama 3.2": As promised two weeks ago, here's the short course on Meta's latest open model! This short course is created with Meta and taught by Amit Sangani, Director of AI Partner Engineering at Meta. Meta’s Llama family of models is leading the way in open models, allowing anyone to download, customize, fine-tune, or build new applications on top of them. Learn about the vision capabilities of the Llama 3.2, and use it for image classification, prompting, tokenization, tool-calling. You'll also learn about the open-source Llama stack, which gives building blocks for many different stages of the LLM application life cycle. In detail, you’ll: - Learn what are the features of Meta's four newest models, and when to use which Llama model. - Learn best practices for multimodal prompting, with applications to advanced image reasoning, illustrated by many examples: Understanding errors on a car dashboard, adding up the total of photographed restaurant receipts, grading written math homework. - Use different roles—system, user, assistant, ipython—in the Llama 3.1 and 3.2 models and the prompt format that identifies those roles. - Understand how Llama uses the tiktoken tokenizer, and how it has expanded to a 128k vocabulary size that improves encoding efficiency and multilingual support. - Learn how to prompt Llama to call built-in and custom tools (functions) with examples for web search and solving math equations. - Learn about Llama Stack, a standardized interface for common toolchain components like fine-tuning or synthetic data generation, useful for building agentic applications. By the end of this course, you’ll be equipped to build out new applications with the new Llama 3.2. Thank you to Ahmad Al-Dahle, Amit Sangani, and the whole AI at Meta team AI at Meta for all the hard work on Llama 3.2 — we’re excited to make these open models even more accessible to more developers with this new course! Please sign up here!

"Introducing Multimodal Llama 3.2": As promised two weeks ago, here's the short course on Meta's latest open model! This short course is created with Meta and taught by Amit Sangani, Director of AI Partner Engineering at Meta. Meta’s Llama family of models is leading the way in open models, allowing anyone to download, customize, fine-tune, or build new applications on top of them. Learn about the vision capabilities of the Llama 3.2, and use it for image classification, prompting, tokenization, tool-calling. You'll also learn about the open-source Llama stack, which gives building blocks for many different stages of the LLM application life cycle. In detail, you’ll: - Learn what are the features of Meta's four newest models, and when to use which Llama model. - Learn best practices for multimodal prompting, with applications to advanced image reasoning, illustrated by many examples: Understanding errors on a car dashboard, adding up the total of photographed restaurant receipts, grading written math homework. - Use different roles—system, user, assistant, ipython—in the Llama 3.1 and 3.2 models and the prompt format that identifies those roles. - Understand how Llama uses the tiktoken tokenizer, and how it has expanded to a 128k vocabulary size that improves encoding efficiency and multilingual support. - Learn how to prompt Llama to call built-in and custom tools (functions) with examples for web search and solving math equations. - Learn about Llama Stack, a standardized interface for common toolchain components like fine-tuning or synthetic data generation, useful for building agentic applications. By the end of this course, you’ll be equipped to build out new applications with the new Llama 3.2. Thank you to Ahmad Al-Dahle, Amit Sangani, and the whole AI at Meta team AI at Meta for all the hard work on Llama 3.2 — we’re excited to make these open models even more accessible to more developers with this new course! Please sign up here!

Andrew Ng

131,606 views • 1 year ago

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

Amir Zamir

73,074 views • 11 months ago

$I just added the new Llama 3.2 1B and 3B models to LitGPT, the open-source LLM library I help develop (focused on efficiency and code readability). LitGPT allows you to fine-tune and use these models on the cloud or a laptop. So, if you are looking for something to play with this weekend: # 1) Finetune the model litgpt finetune_lora meta-llama/Llama-3.2-1B \ --data JSON \ --data.json_path my_custom_dataset.json \ --train.epochs 1 \ --out_dir out/llama-3.2-finetuned \ --precision bf16-true # 2) Chat with the model litgpt chat out/llama-3.2-finetuned/final # 3) Serve the model via an API endpoint litgpt serve out/llama-3.2-finetuned/final$

I just added the new Llama 3.2 1B and 3B models to LitGPT, the open-source LLM library I help develop (focused on efficiency and code readability). LitGPT allows you to fine-tune and use these models on the cloud or a laptop. So, if you are looking for something to play with this weekend: # 1) Finetune the model litgpt finetune_lora meta-llama/Llama-3.2-1B \ --data JSON \ --data.json_path my_custom_dataset.json \ --train.epochs 1 \ --out_dir out/llama-3.2-finetuned \ --precision bf16-true # 2) Chat with the model litgpt chat out/llama-3.2-finetuned/final # 3) Serve the model via an API endpoint litgpt serve out/llama-3.2-finetuned/final

Sebastian Raschka

65,529 views • 1 year ago

Starting today, open source is leading the way. Introducing Llama 3.1: Our most capable models yet. Today we’re releasing a collection of new Llama 3.1 models including our long awaited 405B. These models deliver improved reasoning capabilities, a larger 128K token context window and improved support for 8 languages among other improvements. Llama 3.1 405B rivals leading closed source models on state-of-the-art capabilities across a range of tasks in general knowledge, steerability, math, tool use and multilingual translation. The models are available to download now directly from Meta or Hugging Face. With today’s release the ecosystem is also ready to go with 25+ partners rolling out our latest models — including Amazon Web Services, NVIDIA, Databricks, Groq Inc, Dell Technologies, Microsoft Azure and Google Cloud ready on day one. More details in the full announcement ➡️ Download Llama 3.1 models ➡️ With these releases we’re setting the stage for unprecedented new opportunities and we can’t wait to see the innovation our newest models will unlock across all levels of the AI community.

Starting today, open source is leading the way. Introducing Llama 3.1: Our most capable models yet. Today we’re releasing a collection of new Llama 3.1 models including our long awaited 405B. These models deliver improved reasoning capabilities, a larger 128K token context window and improved support for 8 languages among other improvements. Llama 3.1 405B rivals leading closed source models on state-of-the-art capabilities across a range of tasks in general knowledge, steerability, math, tool use and multilingual translation. The models are available to download now directly from Meta or Hugging Face. With today’s release the ecosystem is also ready to go with 25+ partners rolling out our latest models — including Amazon Web Services, NVIDIA, Databricks, Groq Inc, Dell Technologies, Microsoft Azure and Google Cloud ready on day one. More details in the full announcement ➡️ Download Llama 3.1 models ➡️ With these releases we’re setting the stage for unprecedented new opportunities and we can’t wait to see the innovation our newest models will unlock across all levels of the AI community.

AI at Meta

1,268,507 views • 1 year ago

Google just released PaliGemma 2 Mix: new versatile instruction vision language models 🔥 > Three new models: 3B, 10B, 28B with res 224, 448 💙 > Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything 🤯

Google just released PaliGemma 2 Mix: new versatile instruction vision language models 🔥 > Three new models: 3B, 10B, 28B with res 224, 448 💙 > Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything 🤯

merve

83,708 views • 1 year ago

I just created my own OCR app using Llama 3.2 vision! Upload an image, and it converts it into structured markdown using Llama 3.2 multimodal! Here's what I used: - Ollama for serving Llama 3.2 vision locally. - Streamlit for the UI. Everything in just 50 lines of code! Find the code in the next tweet. -- Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs. Find me → Avi Chawla

I just created my own OCR app using Llama 3.2 vision! Upload an image, and it converts it into structured markdown using Llama 3.2 multimodal! Here's what I used: - Ollama for serving Llama 3.2 vision locally. - Streamlit for the UI. Everything in just 50 lines of code! Find the code in the next tweet. -- Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs. Find me → Avi Chawla

Avi Chawla

131,139 views • 1 year ago

Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥 > WhisperSpeech X Llama 3.1 8B > Trained on 50K hours of speech (7 languages) > Continually trained on 45hrs 10x A1000s > MLS -> WhisperVQ tokens -> Llama 3.1 > Instruction tuned on 1.89M samples > 70% speech, 20% transcription, 10% text > Apache 2.0 licensed ⚡ Architecture: > WhisperSpeech/ VQ for Semantic Tokens > Llama 3.1 8B Instruct for Text backbone > Early fusion (Chameleon) I'm super bullish on Homebrew to Menlo and early fusion, audio and text, multimodal models! (P.S. Play with the demo on Hugging Face)

Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥 > WhisperSpeech X Llama 3.1 8B > Trained on 50K hours of speech (7 languages) > Continually trained on 45hrs 10x A1000s > MLS -> WhisperVQ tokens -> Llama 3.1 > Instruction tuned on 1.89M samples > 70% speech, 20% transcription, 10% text > Apache 2.0 licensed ⚡ Architecture: > WhisperSpeech/ VQ for Semantic Tokens > Llama 3.1 8B Instruct for Text backbone > Early fusion (Chameleon) I'm super bullish on Homebrew to Menlo and early fusion, audio and text, multimodal models! (P.S. Play with the demo on Hugging Face)

Vaibhav (VB) Srivastav

82,126 views • 1 year ago

From text to reality: MIT researchers find new ways to use LLMs to help in design & manufacturing. These models can convert text prompts to CAD, generate manufacturing instructions, and search for optimal designs:

From text to reality: MIT researchers find new ways to use LLMs to help in design & manufacturing. These models can convert text prompts to CAD, generate manufacturing instructions, and search for optimal designs:

MIT CSAIL

59,152 views • 2 years ago

Can GPT-4-Vision Play Super Mario 64? I created 'Multimodal Gamer,' a framework enabling multi-modal models (combining text and visual inputs) to play games. Check out my video overview below and let me know your thoughts!

Can GPT-4-Vision Play Super Mario 64? I created 'Multimodal Gamer,' a framework enabling multi-modal models (combining text and visual inputs) to play games. Check out my video overview below and let me know your thoughts!

Josh

22,233 views • 2 years ago

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Andrew Ng

73,915 views • 1 year ago

Introducing "Building with Llama 4." This short course is created with Meta AI at Meta, and taught by Amit Sangani, Director of Partner Engineering for Meta’s AI team. Meta’s new Llama 4 has added three new models and introduced the Mixture-of-Experts (MoE) architecture to its family of open-weight models, making them more efficient to serve. In this course, you’ll work with two of the three new models introduced in Llama 4. First is Maverick, a 400B parameter model, with 128 experts and 17B active parameters. Second is Scout, a 109B parameter model with 16 experts and 17B active parameters. Maverick and Scout support long context windows of up to a million tokens and 10M tokens, respectively. The latter is enough to support directly inputting even fairly large GitHub repos for analysis! In hands-on lessons, you’ll build apps using Llama 4’s new multimodal capabilities including reasoning across multiple images and image grounding, in which you can identify elements in images. You’ll also use the official Llama API, work with Llama 4’s long-context abilities, and learn about Llama’s newest open-source tools: its prompt optimization tool that automatically improves system prompts and synthetic data kit that generates high-quality datasets for fine-tuning. If you need an open model, Llama is a great option, and the Llama 4 family is an important part of any GenAI developer's toolkit. Through this course, you’ll learn to call Llama 4 via API, use its optimization tools, and build features that span text, images, and large context. Please sign up here:

Introducing "Building with Llama 4." This short course is created with Meta AI at Meta, and taught by Amit Sangani, Director of Partner Engineering for Meta’s AI team. Meta’s new Llama 4 has added three new models and introduced the Mixture-of-Experts (MoE) architecture to its family of open-weight models, making them more efficient to serve. In this course, you’ll work with two of the three new models introduced in Llama 4. First is Maverick, a 400B parameter model, with 128 experts and 17B active parameters. Second is Scout, a 109B parameter model with 16 experts and 17B active parameters. Maverick and Scout support long context windows of up to a million tokens and 10M tokens, respectively. The latter is enough to support directly inputting even fairly large GitHub repos for analysis! In hands-on lessons, you’ll build apps using Llama 4’s new multimodal capabilities including reasoning across multiple images and image grounding, in which you can identify elements in images. You’ll also use the official Llama API, work with Llama 4’s long-context abilities, and learn about Llama’s newest open-source tools: its prompt optimization tool that automatically improves system prompts and synthetic data kit that generates high-quality datasets for fine-tuning. If you need an open model, Llama is a great option, and the Llama 4 family is an important part of any GenAI developer's toolkit. Through this course, you’ll learn to call Llama 4 via API, use its optimization tools, and build features that span text, images, and large context. Please sign up here:

Andrew Ng

67,587 views • 1 year ago

3D AI is leveling up! Rodin 3D AI can create stunning, high-quality 3D models from just text or image inputs. And with its latest update, it can even generate 8K HDRI textures to bring your models to life. Check out the link in the comments!

3D AI is leveling up! Rodin 3D AI can create stunning, high-quality 3D models from just text or image inputs. And with its latest update, it can even generate 8K HDRI textures to bring your models to life. Check out the link in the comments!

el.cine

46,032 views • 1 year ago

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 views • 2 years ago

Exciting day today as Meta launches its new Llama 3.1 models and they’re available in Bedrock immediately. Llama 3.1 is a substantial step forward and the models (highlighted by the powerful 405B model) are impressive and powerful. Customers are gonna love these high performance models. Giddy up!

Exciting day today as Meta launches its new Llama 3.1 models and they’re available in Bedrock immediately. Llama 3.1 is a substantial step forward and the models (highlighted by the powerful 405B model) are impressive and powerful. Customers are gonna love these high performance models. Giddy up!

Andy Jassy

45,000 views • 1 year ago

Introducing OpenAI o3 and o4-mini—our smartest and most capable models to date. For the first time, our reasoning models can agentically use and combine every tool within ChatGPT, including web search, Python, image analysis, file interpretation, and image generation.

Introducing OpenAI o3 and o4-mini—our smartest and most capable models to date. For the first time, our reasoning models can agentically use and combine every tool within ChatGPT, including web search, Python, image analysis, file interpretation, and image generation.

OpenAI

3,720,026 views • 1 year ago

How do we build multimodal systems that work effectively across the globe? 🌍 Today we release the Aya Vision Technical Report, the detailed recipe behind Aya Vision models, unifying state-of-the-art multilingual capabilities in multimodal and text tasks across 23 languages!

How do we build multimodal systems that work effectively across the globe? 🌍 Today we release the Aya Vision Technical Report, the detailed recipe behind Aya Vision models, unifying state-of-the-art multilingual capabilities in multimodal and text tasks across 23 languages!

Cohere Labs

15,571 views • 1 year ago

Llama 3.2 can do selective image editing, and it’s really impressive!

Llama 3.2 can do selective image editing, and it’s really impressive!

AshutoshShrivastava

247,292 views • 1 year ago