Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

merve

87,177 subscribers

28,014 views • 1 year ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

10 Comments

merve1 year ago

Link to model: Try the demo right away: Use the model with @huggingface transformers 🤗

merve1 year ago

I will release fine-tuning scripts and quantized versions tomorrow, don't fret 😄

Andi Marafioti1 year ago

Wow merve, how cool is this 😍

Silviu Paun1 year ago

@mervenoyann The model seems great! More of a general question: when finetuning, any efficient strategies you can recommend that will preserve the original capabilities of the model?

merve1 year ago

currently I'm trying to finetune but there's a small bug we're trying to fix 🥲 I feel like if you want to preserve original model a low rank adapter would work better than fully finetuning

Mihai Chirculescu1 year ago

Do you have teaining scripts for lora finetuning it?

merve1 year ago

I will release sometime tomorrow 😊 along with quantized checkpoints

Mihai Chirculescu1 year ago

Does it accept only one image per user input (which will be resized to 384x384)?

merve1 year ago

no I think you can provide multiple images, but provide as many image tokens explicitly @HugoLaurencon knows better in this demo this isn't the case though

Furkan Gözükara1 year ago

wow this looks amazing for image captioning it gave a really good caption did you investigate which prompt for this task?

Related Videos

Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥 > WhisperSpeech X Llama 3.1 8B > Trained on 50K hours of speech (7 languages) > Continually trained on 45hrs 10x A1000s > MLS -> WhisperVQ tokens -> Llama 3.1 > Instruction tuned on 1.89M samples > 70% speech, 20% transcription, 10% text > Apache 2.0 licensed ⚡ Architecture: > WhisperSpeech/ VQ for Semantic Tokens > Llama 3.1 8B Instruct for Text backbone > Early fusion (Chameleon) I'm super bullish on Homebrew to Menlo and early fusion, audio and text, multimodal models! (P.S. Play with the demo on Hugging Face)

Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥 > WhisperSpeech X Llama 3.1 8B > Trained on 50K hours of speech (7 languages) > Continually trained on 45hrs 10x A1000s > MLS -> WhisperVQ tokens -> Llama 3.1 > Instruction tuned on 1.89M samples > 70% speech, 20% transcription, 10% text > Apache 2.0 licensed ⚡ Architecture: > WhisperSpeech/ VQ for Semantic Tokens > Llama 3.1 8B Instruct for Text backbone > Early fusion (Chameleon) I'm super bullish on Homebrew to Menlo and early fusion, audio and text, multimodal models! (P.S. Play with the demo on Hugging Face)

Vaibhav (VB) Srivastav

82,126 views • 1 year ago

Introducing "Building with Llama 4." This short course is created with Meta AI at Meta, and taught by Amit Sangani, Director of Partner Engineering for Meta’s AI team. Meta’s new Llama 4 has added three new models and introduced the Mixture-of-Experts (MoE) architecture to its family of open-weight models, making them more efficient to serve. In this course, you’ll work with two of the three new models introduced in Llama 4. First is Maverick, a 400B parameter model, with 128 experts and 17B active parameters. Second is Scout, a 109B parameter model with 16 experts and 17B active parameters. Maverick and Scout support long context windows of up to a million tokens and 10M tokens, respectively. The latter is enough to support directly inputting even fairly large GitHub repos for analysis! In hands-on lessons, you’ll build apps using Llama 4’s new multimodal capabilities including reasoning across multiple images and image grounding, in which you can identify elements in images. You’ll also use the official Llama API, work with Llama 4’s long-context abilities, and learn about Llama’s newest open-source tools: its prompt optimization tool that automatically improves system prompts and synthetic data kit that generates high-quality datasets for fine-tuning. If you need an open model, Llama is a great option, and the Llama 4 family is an important part of any GenAI developer's toolkit. Through this course, you’ll learn to call Llama 4 via API, use its optimization tools, and build features that span text, images, and large context. Please sign up here:

Introducing "Building with Llama 4." This short course is created with Meta AI at Meta, and taught by Amit Sangani, Director of Partner Engineering for Meta’s AI team. Meta’s new Llama 4 has added three new models and introduced the Mixture-of-Experts (MoE) architecture to its family of open-weight models, making them more efficient to serve. In this course, you’ll work with two of the three new models introduced in Llama 4. First is Maverick, a 400B parameter model, with 128 experts and 17B active parameters. Second is Scout, a 109B parameter model with 16 experts and 17B active parameters. Maverick and Scout support long context windows of up to a million tokens and 10M tokens, respectively. The latter is enough to support directly inputting even fairly large GitHub repos for analysis! In hands-on lessons, you’ll build apps using Llama 4’s new multimodal capabilities including reasoning across multiple images and image grounding, in which you can identify elements in images. You’ll also use the official Llama API, work with Llama 4’s long-context abilities, and learn about Llama’s newest open-source tools: its prompt optimization tool that automatically improves system prompts and synthetic data kit that generates high-quality datasets for fine-tuning. If you need an open model, Llama is a great option, and the Llama 4 family is an important part of any GenAI developer's toolkit. Through this course, you’ll learn to call Llama 4 via API, use its optimization tools, and build features that span text, images, and large context. Please sign up here:

Andrew Ng

67,710 views • 1 year ago

You can now try Llama 3.1 405B for free (link below)! This is the largest open-source model out there, and for the first time, an open model is competitive with closed models. This time around, Meta did something new: Llama 3.1 has a license that allows developers to use it to enhance other models. For the first time, you can distill Llama 3.1 405B's capabilities into a smaller, more practical model for your use case. First, here is the link where you can play with Llama 3.1 for free: The model is hosted in Tune Studio, an end-to-end platform for developing applications using Large Language Models. They are sponsoring this post. Take a look at the attached video. It will show you how you can fine-tune a simple model using Llama 3.1 without leaving the platform: 1. You can create an empty dataset 2. Use the playground to generate and record interactions with Llama 3.1 3. Modify the dataset directly using the playground 4. Export the data and fine-tune a smaller model Fast and easy! As long as you have a web browser, you can start experimenting with fine-tuning and Llama 3.1. That's all it takes!

You can now try Llama 3.1 405B for free (link below)! This is the largest open-source model out there, and for the first time, an open model is competitive with closed models. This time around, Meta did something new: Llama 3.1 has a license that allows developers to use it to enhance other models. For the first time, you can distill Llama 3.1 405B's capabilities into a smaller, more practical model for your use case. First, here is the link where you can play with Llama 3.1 for free: The model is hosted in Tune Studio, an end-to-end platform for developing applications using Large Language Models. They are sponsoring this post. Take a look at the attached video. It will show you how you can fine-tune a simple model using Llama 3.1 without leaving the platform: 1. You can create an empty dataset 2. Use the playground to generate and record interactions with Llama 3.1 3. Modify the dataset directly using the playground 4. Export the data and fine-tune a smaller model Fast and easy! As long as you have a web browser, you can start experimenting with fine-tuning and Llama 3.1. That's all it takes!

Santiago

55,609 views • 1 year ago

Gemini 3.1 Flash-Lite is the model to build always on multimodal AI Agents. I just built a memory agent with Google ADK that runs 24/7. Feed it text, images, audio, video or PDFs. It reads, connects and consolidates everything while you sleep. 100% Opensource code.

Gemini 3.1 Flash-Lite is the model to build always on multimodal AI Agents. I just built a memory agent with Google ADK that runs 24/7. Feed it text, images, audio, video or PDFs. It reads, connects and consolidates everything while you sleep. 100% Opensource code.

Shubham Saboo

24,930 views • 3 months ago

This is the fastest I've seen Llama 3.3 running anywhere! Llama 3.3 70B running at 652 t/s is lightning fast. And if you want Llama 3.1, here are the speeds I was able to get: • Llama 3.1 8B: 1006 t/s • Llama 3.1 70B: 709 t/s • Llama 3.1 405B: 206 t/s (You can access all of these models for free! See the link below.) This speed is incredible, but the interesting part is what's happening behind the scenes: These models aren't running on a GPU! In this video, I'm using the SambaNova cloud to access these models. They built a custom chip (SN40L) optimized for AI workflows. A single SN40L chip can hold hundreds of models (trillions of parameters) in memory! The speed alone is a huge deal, but the big advantage is for agentic workflows running multiple specialized models. A GPU can only host a single model and switch (unload and load) to a different model if necessary. An SN40L, on the other hand, can host every model at once, making it much faster. Here is the video where you can see how fast these chips are:

This is the fastest I've seen Llama 3.3 running anywhere! Llama 3.3 70B running at 652 t/s is lightning fast. And if you want Llama 3.1, here are the speeds I was able to get: • Llama 3.1 8B: 1006 t/s • Llama 3.1 70B: 709 t/s • Llama 3.1 405B: 206 t/s (You can access all of these models for free! See the link below.) This speed is incredible, but the interesting part is what's happening behind the scenes: These models aren't running on a GPU! In this video, I'm using the SambaNova cloud to access these models. They built a custom chip (SN40L) optimized for AI workflows. A single SN40L chip can hold hundreds of models (trillions of parameters) in memory! The speed alone is a huge deal, but the big advantage is for agentic workflows running multiple specialized models. A GPU can only host a single model and switch (unload and load) to a different model if necessary. An SN40L, on the other hand, can host every model at once, making it much faster. Here is the video where you can see how fast these chips are:

Santiago

97,505 views • 1 year ago

🤯🤯 You can now create a chatbot on ANY GitHub repo using the Llama 3.1 405B model with Hugging Face assistants -- FOR FREE! 💰 This is insane! 🚀 Link: clem 🤗 Julien Chaumond

🤯🤯 You can now create a chatbot on ANY GitHub repo using the Llama 3.1 405B model with Hugging Face assistants -- FOR FREE! 💰 This is insane! 🚀 Link: clem 🤗 Julien Chaumond

Satvik Paramkusham

162,848 views • 1 year ago

🆕 How to run (and finetune) open source AI models with a simple API! In 5 mins, I go over how to: ◆ Generate text with DeepSeek R1 & Llama 3 ◆ Generate code with Qwen on LlamaCoder ◆ Generate images with Flux on BlinkShot ◆ Finetune a model on your own data & run it

🆕 How to run (and finetune) open source AI models with a simple API! In 5 mins, I go over how to: ◆ Generate text with DeepSeek R1 & Llama 3 ◆ Generate code with Qwen on LlamaCoder ◆ Generate images with Flux on BlinkShot ◆ Finetune a model on your own data & run it

Hassan

30,236 views • 1 year ago

Multimodal AI is here 🤯 GPT-4 can now turn your images into a text file in a snap with the new code interpreter model. Witness the OCR magic in action 🔥

Multimodal AI is here 🤯 GPT-4 can now turn your images into a text file in a snap with the new code interpreter model. Witness the OCR magic in action 🔥

Shubham Saboo

727,655 views • 3 years ago

Our Llama-3.1-Nemotron-70B-Instruct model is a leading model on the 🏆 Arena Hard benchmark (85) from Arena. Arena Hard uses a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, and is known for its predictive ability of Chatbot Arena Elo score as well as separability between helpful and less helpful models. Use our customized model Llama-3.1-Nemotron-70B to improve the helpfulness of LLM generated responses in your applications. 📥 Try on our API catalog: 📥 On GitHub: 📥 Or on Hugging Face:

Our Llama-3.1-Nemotron-70B-Instruct model is a leading model on the 🏆 Arena Hard benchmark (85) from Arena. Arena Hard uses a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, and is known for its predictive ability of Chatbot Arena Elo score as well as separability between helpful and less helpful models. Use our customized model Llama-3.1-Nemotron-70B to improve the helpfulness of LLM generated responses in your applications. 📥 Try on our API catalog: 📥 On GitHub: 📥 Or on Hugging Face:

NVIDIA AI Developer

140,756 views • 1 year ago

$I just added the new Llama 3.2 1B and 3B models to LitGPT, the open-source LLM library I help develop (focused on efficiency and code readability). LitGPT allows you to fine-tune and use these models on the cloud or a laptop. So, if you are looking for something to play with this weekend: # 1) Finetune the model litgpt finetune_lora meta-llama/Llama-3.2-1B \ --data JSON \ --data.json_path my_custom_dataset.json \ --train.epochs 1 \ --out_dir out/llama-3.2-finetuned \ --precision bf16-true # 2) Chat with the model litgpt chat out/llama-3.2-finetuned/final # 3) Serve the model via an API endpoint litgpt serve out/llama-3.2-finetuned/final$

I just added the new Llama 3.2 1B and 3B models to LitGPT, the open-source LLM library I help develop (focused on efficiency and code readability). LitGPT allows you to fine-tune and use these models on the cloud or a laptop. So, if you are looking for something to play with this weekend: # 1) Finetune the model litgpt finetune_lora meta-llama/Llama-3.2-1B \ --data JSON \ --data.json_path my_custom_dataset.json \ --train.epochs 1 \ --out_dir out/llama-3.2-finetuned \ --precision bf16-true # 2) Chat with the model litgpt chat out/llama-3.2-finetuned/final # 3) Serve the model via an API endpoint litgpt serve out/llama-3.2-finetuned/final

Sebastian Raschka

65,529 views • 1 year ago

With today’s launch of our Llama 3.1 collection of models we’re making history with the largest and most capable open source AI model ever released. 128K context length, multilingual support, and new safety tools. Download 405B and our improved 8B & 70B here.

With today’s launch of our Llama 3.1 collection of models we’re making history with the largest and most capable open source AI model ever released. 128K context length, multilingual support, and new safety tools. Download 405B and our improved 8B & 70B here.

Ahmad Al-Dahle

866,426 views • 1 year ago

🚀 Introducing Emu3.5 — a large-scale multimodal world model that natively predicts the next vision-language state. 🔥 Trained on over 10T interleaved vision-language tokens and enhanced with reinforcement learning, Emu3.5 achieves powerful multimodal reasoning and generation. ⚡ Powered by our new Discrete Diffusion Adaptation (DiDA) for 20× faster inference. 🔥 Emu3.5 outperforms Nano Banana across image generation, editing, interleaved tasks and more. 🌍 Explore Emu3.5: Github: #Emu3 #MultimodalAI #WorldModel #NextTokenPrediction

🚀 Introducing Emu3.5 — a large-scale multimodal world model that natively predicts the next vision-language state. 🔥 Trained on over 10T interleaved vision-language tokens and enhanced with reinforcement learning, Emu3.5 achieves powerful multimodal reasoning and generation. ⚡ Powered by our new Discrete Diffusion Adaptation (DiDA) for 20× faster inference. 🔥 Emu3.5 outperforms Nano Banana across image generation, editing, interleaved tasks and more. 🌍 Explore Emu3.5: Github: #Emu3 #MultimodalAI #WorldModel #NextTokenPrediction

BAAI

51,880 views • 8 months ago

The video is a Llama v1 7B model implemented in MLX and running on an M2 Ultra. More here: * Train a Transformer LM or fine-tune with LoRA * Text generation with Mistral * Image generation with Stable Diffusion * Speech recognition with Whisper

The video is a Llama v1 7B model implemented in MLX and running on an M2 Ultra. More here: * Train a Transformer LM or fine-tune with LoRA * Text generation with Mistral * Image generation with Stable Diffusion * Speech recognition with Whisper

Awni Hannun

66,565 views • 2 years ago

In December, we launched Gemini 1.0 Pro. Today, we're introducing Gemini 1.5 Pro! 🚀 This next-gen model uses a Mixture-of-Experts (MoE) approach for more efficient training & higher-quality responses. Gemini 1.5 Pro, our mid-sized model, will soon come standard with a 128K-token context window, but starting today, developers + customers can sign up for the limited Private Preview to try out 1.5 Pro with a groundbreaking and experimental 1 million token context window! The 1M tokens feature unlocks huge possibilities for devs - upload hundreds of pages of text, entire code repos, and long videos and let Gemini reason across them. It's still experimental and early and we’d love your feedback - learn more here.

In December, we launched Gemini 1.0 Pro. Today, we're introducing Gemini 1.5 Pro! 🚀 This next-gen model uses a Mixture-of-Experts (MoE) approach for more efficient training & higher-quality responses. Gemini 1.5 Pro, our mid-sized model, will soon come standard with a 128K-token context window, but starting today, developers + customers can sign up for the limited Private Preview to try out 1.5 Pro with a groundbreaking and experimental 1 million token context window! The 1M tokens feature unlocks huge possibilities for devs - upload hundreds of pages of text, entire code repos, and long videos and let Gemini reason across them. It's still experimental and early and we’d love your feedback - learn more here.

Sundar Pichai

852,554 views • 2 years ago

Let's go hands-on with #GeminiAI. Our newest AI model can reason across different types of inputs and outputs — like images and text. See Gemini's multimodal reasoning capabilities in action ↓

Let's go hands-on with #GeminiAI. Our newest AI model can reason across different types of inputs and outputs — like images and text. See Gemini's multimodal reasoning capabilities in action ↓

Google

1,005,825 views • 2 years ago

jina-embeddings-v5-omni is here! Our first universal embedding model for text, images, audio, and video. Available in two sizes: small (1.57B, 1024-dim, 32K context) and nano (0.95B, 768-dim, 8K context). Both support Matryoshka truncation down to 32 dimensions. v5-omni is back-compatible: if you already use jina-embeddings-v5-text-small/nano, the existing text indexes work with v5-omni out of the box. Without reindexing the text, just index your multimodal content with v5-omni and start searching images, audio, and video.

jina-embeddings-v5-omni is here! Our first universal embedding model for text, images, audio, and video. Available in two sizes: small (1.57B, 1024-dim, 32K context) and nano (0.95B, 768-dim, 8K context). Both support Matryoshka truncation down to 32 dimensions. v5-omni is back-compatible: if you already use jina-embeddings-v5-text-small/nano, the existing text indexes work with v5-omni out of the box. Without reindexing the text, just index your multimodal content with v5-omni and start searching images, audio, and video.

Jina AI

133,435 views • 1 month ago

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 views • 1 year ago

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 views • 9 months ago

JUST IN: Google releases Gemini 1.5, a powerful MoE model. It's a huge breakthrough. The model has the longest context window ever seen: 1 million tokens. It can process 1 hour of video, 11 hours of audio, 30,000 lines of code, or 700,000 words in a single prompt. When tested on text, code, image, audio and video evaluations, 1.5 Pro outperforms 1.0 Pro on 87% of the benchmarks used for developing LLMs. You can can sign up in AI Studio to try it out.

JUST IN: Google releases Gemini 1.5, a powerful MoE model. It's a huge breakthrough. The model has the longest context window ever seen: 1 million tokens. It can process 1 hour of video, 11 hours of audio, 30,000 lines of code, or 700,000 words in a single prompt. When tested on text, code, image, audio and video evaluations, 1.5 Pro outperforms 1.0 Pro on 87% of the benchmarks used for developing LLMs. You can can sign up in AI Studio to try it out.

Lior Alexander

83,409 views • 2 years ago

WHY IS NO ONE TALKING ABOUT THIS?? Gemma 3n model was one of the best surprises for me. The fact that you can run it on edge devices even with just 2GB of RAM is impressive. A few weeks back, I was on holiday and used the Gemini Live feature a lot. But I kept running into issues whenever I was in a place where the network wasn’t reliable. Gemma 3n : > Multimodal: Supports text and image inputs; video/audio in development. > Context Window: Up to 128K tokens (32K for 1B model). > Multilingual: Trained on 140+ languages. > Privacy: Offline, on-device processing for data security. > Model Sizes: E2B (5B parameters, 2GB RAM) and E4B (10–12B parameters, 3GB RAM). Video credit: Google YT

WHY IS NO ONE TALKING ABOUT THIS?? Gemma 3n model was one of the best surprises for me. The fact that you can run it on edge devices even with just 2GB of RAM is impressive. A few weeks back, I was on holiday and used the Gemini Live feature a lot. But I kept running into issues whenever I was in a place where the network wasn’t reliable. Gemma 3n : > Multimodal: Supports text and image inputs; video/audio in development. > Context Window: Up to 128K tokens (32K for 1B model). > Multilingual: Trained on 140+ languages. > Privacy: Offline, on-device processing for data security. > Model Sizes: E2B (5B parameters, 2GB RAM) and E4B (10–12B parameters, 3GB RAM). Video credit: Google YT

AshutoshShrivastava

210,490 views • 1 year ago