Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

New 🤗 transformers release includes a very powerful Multimodel Large Language Model (MLLM) by Microsoft called KOSMOS-2! 🤩 The highlight of KOSMOS-2 is grounding, the model is incredibly accurate! 🌎 Play with the demo here 👉 But how does this model work? Let's take a look! 👀🧶

merve

88,571 subscribers

143,830 views • 2 years ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

11 Comments

merve2 years ago

Grounding helps machine learning models relate to real-world examples. Including grounding makes models more performant by means of accuracy and robustness during inference. It also helps reduce the so-called "hallucinations" in language models.

merve2 years ago

In Kosmos-2, model is grounded to perform following tasks and is evaluated on 👇 - multimodal grounding & phrase grounding, e.g. localizing the object through natural language query - multimodal referring, e.g. describing object characteristics & location - perception-language tasks - language understanding and generation

merve2 years ago

The dataset used for grounding, called GRiT is also available on Hugging Face Hub 👉 Thanks to transformers integration, you can use KOSMOS-2 with few lines of code 🤩 See below! 👇

merve2 years ago

also big kudos to @ydshieh for implementing this in transformers ✨

Rainmaker2 years ago

Can Machine Learning beat the market? Check out this post on my free Substack where I share code and commentary for an XGBoost model and a Random Forest model that both deliver powerful performances.

merve2 years ago

multimodal* 🥲

Luis C2 years ago

@Microsoft Wow, its very fast on an A40

iamrobotbear (bk)2 years ago

@Microsoft License?

Vlad2 years ago

@ClementDelangue @Microsoft 🤔 I could use this to improve I'm using the Blip model and is not bad but this looks like it could give more accurate results.

SkalskiP2 years ago

@Microsoft Yup KOSMOS-2 is awesome!

Risichad 🦾2 years ago

@Microsoft AI can understand a video at 3fps then !!!

Related Videos

I'm stoked to be showing my challenge students how to use AI with ROS 2! In this demo the robot detects the coke can using a Large Language Model called CLIPSeg, which uses a text prompt "A can of soda" to segment the image .

I'm stoked to be showing my challenge students how to use AI with ROS 2! In this demo the robot detects the coke can using a Large Language Model called CLIPSeg, which uses a text prompt "A can of soda" to segment the image .

Dr. John Vial

12,274 views • 1 year ago

Two videos of Russian interceptor drones with AI-targeting The first is the 'Bolt' UAV, a new model with ~15km range The second is the Yolka, a very common model with 2-3km range. It has no explosives, only kinetic forces

Two videos of Russian interceptor drones with AI-targeting The first is the 'Bolt' UAV, a new model with ~15km range The second is the Yolka, a very common model with 2-3km range. It has no explosives, only kinetic forces

EventsInUkraine

10,509 views • 1 month ago

I've seen so many people criticize the stock sonic model and say the model needs changing. But no, the model is fine. the problem is with how it's always posed. here for example with the unleashed model you can simply deform the brows and the muzzle to curve. This is pretty much the sonic unleashed game model but with a good rig that makes anything possible with the face. (Rig by DANCADA³ᴰ)

I've seen so many people criticize the stock sonic model and say the model needs changing. But no, the model is fine. the problem is with how it's always posed. here for example with the unleashed model you can simply deform the brows and the muzzle to curve. This is pretty much the sonic unleashed game model but with a good rig that makes anything possible with the face. (Rig by DANCADA³ᴰ)

TBSF

192,128 views • 1 year ago

OpenAI Co-founder Andrej Karpathy explains the new computing paradigm: "We're entering a new computing paradigm with large language models acting like CPUs, using tokens instead of bytes, and having a context window instead of RAM. This is the Large Language Model OS (LMOS)"

OpenAI Co-founder Andrej Karpathy explains the new computing paradigm: "We're entering a new computing paradigm with large language models acting like CPUs, using tokens instead of bytes, and having a context window instead of RAM. This is the Large Language Model OS (LMOS)"

Haider.

552,382 views • 2 years ago

Language is the future for how we interact with robots. Today Wayve is sharing a first look at LINGO-1, a new vision-language-action AI model. To give you a glimpse of its capabilities, here is a video of me playing with LINGO-1 yesterday morning.

Language is the future for how we interact with robots. Today Wayve is sharing a first look at LINGO-1, a new vision-language-action AI model. To give you a glimpse of its capabilities, here is a video of me playing with LINGO-1 yesterday morning.

Alex Kendall

154,759 views • 2 years ago

Apple built a large foundation model and fine-tuned it on multiple tasks. But they are doing something very clever: They load a single model in memory and use different adapters to specialize the model on the fly. I recorded a video to show you how to write the code to do the same thing Apple is doing. I explain everything step by step. Here is what I'll show you in the video: 1. We'll load two datasets 2. Then load a large model 3. Then, we'll fine-tune the model on both datasets I'll use LoRA to fine-tune the model. This process creates two small adapters, each specializing in solving one of the datasets. The base model's original parameters will remain unchanged. From here: 4. We'll generate a list of tasks 5. We'll load the correct adapter to solve each task The large model I'm using needs 346 MB of memory, but I only need to load it once. Each adapter is only 2.7 MB. I only need to load the base model once and pair it with any of the fine-tuned adapters. Minimum memory footprint and I can solve multiple tasks. Hope this helps!

Apple built a large foundation model and fine-tuned it on multiple tasks. But they are doing something very clever: They load a single model in memory and use different adapters to specialize the model on the fly. I recorded a video to show you how to write the code to do the same thing Apple is doing. I explain everything step by step. Here is what I'll show you in the video: 1. We'll load two datasets 2. Then load a large model 3. Then, we'll fine-tune the model on both datasets I'll use LoRA to fine-tune the model. This process creates two small adapters, each specializing in solving one of the datasets. The base model's original parameters will remain unchanged. From here: 4. We'll generate a list of tasks 5. We'll load the correct adapter to solve each task The large model I'm using needs 346 MB of memory, but I only need to load it once. Each adapter is only 2.7 MB. I only need to load the base model once and pair it with any of the fine-tuned adapters. Minimum memory footprint and I can solve multiple tasks. Hope this helps!

Santiago

84,747 views • 1 year ago

I've had early access to Kling AI brand new 2.0 video model and done extensive testing and trust me when I say that this model upgrade is AMAZING!!! 🔥 The amount of incredible dynamic action that you can generate now is next level. If you want to make things go fast, the new model completely excels at this, movement looks very natural and fluid. This is a major release for Kling! Here is a collection of some clips I've generated with the 2.0 model.

Travis Davids

87,789 views • 1 year ago

$AI agents are about to redefine the internet. The mistake we made with Large Language Models? We let a handful of corporations capture all the value. Action Model is building a different path. By training through our extension, users gain fractional ownership in the Large Action Model, giving them a real stake in the future of AI. When LLMs emerged, the upside flowed to Big Tech. This time, it doesn’t have to. They’re building AI on our data, and keeping the upside for themselves. Community-owned Large Action Model is how we take it back.$

AI agents are about to redefine the internet. The mistake we made with Large Language Models? We let a handful of corporations capture all the value. Action Model is building a different path. By training through our extension, users gain fractional ownership in the Large Action Model, giving them a real stake in the future of AI. When LLMs emerged, the upside flowed to Big Tech. This time, it doesn’t have to. They’re building AI on our data, and keeping the upside for themselves. Community-owned Large Action Model is how we take it back.

Action Model

76,962 views • 5 months ago

I went with the Wildcat next. I never get tired of opening a new model! Very excited for this one - I do love a Wildcat. It’s amazing how therapeutic and relaxing model making is.

I went with the Wildcat next. I never get tired of opening a new model! Very excited for this one - I do love a Wildcat. It’s amazing how therapeutic and relaxing model making is.

Dr Sarah-Louise Miller 🇺🇦

20,804 views • 1 year ago

Large Language Diffusion with Masking (LLaDA) are here - and their generation looks so fucking dope! 🤯 True to Yann LeCun's vision, Ditch the auto-regressive bits and approximate the language distribution via Maximum Likelihood Estimation! So cool to watch the model denoise text from tokens in real time! - The team released their model checkpoints and there's a demo for you to play with it too! Try it out!🤗

Large Language Diffusion with Masking (LLaDA) are here - and their generation looks so fucking dope! 🤯 True to Yann LeCun's vision, Ditch the auto-regressive bits and approximate the language distribution via Maximum Likelihood Estimation! So cool to watch the model denoise text from tokens in real time! - The team released their model checkpoints and there's a demo for you to play with it too! Try it out!🤗

Vaibhav (VB) Srivastav

21,410 views • 1 year ago

Okay I take everything back I was on the Suno 3.5 free model which produces very bad vocals This is Suno 4.5+ with the new vocal model it's really really good As a Drum & Bass head, I'd say it's at the level of D&B of about ~5 years ago, very passable as real, every song is very SUB FOCUS so that must have been a lot of the training, which means they picked the best artists of each genre (and maybe licensed them?) [Verse 2] Now I'm using the new Suno 4.5+ model It's so powerful Had to tweet about it Before I was using the Suno 3 free model And it was very bad This is groundbreaking

Okay I take everything back I was on the Suno 3.5 free model which produces very bad vocals This is Suno 4.5+ with the new vocal model it's really really good As a Drum & Bass head, I'd say it's at the level of D&B of about ~5 years ago, very passable as real, every song is very SUB FOCUS so that must have been a lot of the training, which means they picked the best artists of each genre (and maybe licensed them?) [Verse 2] Now I'm using the new Suno 4.5+ model It's so powerful Had to tweet about it Before I was using the Suno 3 free model And it was very bad This is groundbreaking

@levelsio

374,493 views • 11 months ago

Excited to unveil Boltz-2, our new model capable not only of predicting structures but also binding affinities! Boltz-2 is the first AI model to approach the performance of FEP simulations while being more than 1000x faster! All open-sourced under MIT license! A thread… 🤗🚀

Excited to unveil Boltz-2, our new model capable not only of predicting structures but also binding affinities! Boltz-2 is the first AI model to approach the performance of FEP simulations while being more than 1000x faster! All open-sourced under MIT license! A thread… 🤗🚀

Gabriele Corso

312,184 views • 1 year ago

How to match the complexity of the problem you want to solve with the proper model. You want an inference router. In the video, I show you how simple and powerful this is. After this, you'll never talk directly to a model ever again.

How to match the complexity of the problem you want to solve with the proper model. You want an inference router. In the video, I show you how simple and powerful this is. After this, you'll never talk directly to a model ever again.

Santiago

11,944 views • 2 months ago

US President Trump: You know, we ordered brand new B-2 bombers. [...] holding a model This is the brand new one they just ordered. Similar, but actually quite different. It's new and enhanced. "A large number of them", Trump added.

Status-6 (Military & Conflict News)

40,943 views • 11 months ago

This is a new diffusion-based language model from Inception. I don't know how well it works compared to an autoregressive LM of a similar size, but the inference process looks badass:

This is a new diffusion-based language model from Inception. I don't know how well it works compared to an autoregressive LM of a similar size, but the inference process looks badass:

BURKOV

58,434 views • 1 year ago