Loading video...

Video Failed to Load

Go Home

SAMURAI vs. MetaAI's SAM 2! Traditional visual object tracking struggles in crowded, fast-moving, or self-occluded scenes, as does SAM2. Meet SAMURAI: a completely open-source adaptation of the Segment Anything Model for zero-shot visual tracking! Here's why it's a game-changer: ๐Ÿšซ No need for retraining or finetuning ๐ŸŽฏ Boosts success...

363,264 views โ€ข 1 year ago โ€ขvia X (Twitter)

9 Comments

Akshay ๐Ÿš€'s profile picture
Akshay ๐Ÿš€1 year ago

GitHub repo: _____ Interested in ML/AI Engineering? Sign up for our newsletter for in-depth lessons and get a FREE eBook with 150+ core DS/ML lessons:

BensenHsu's profile picture
BensenHsu1 year ago

The paper focuses on adapting the Segment Anything Model 2 (SAM 2) for visual object tracking, which is a challenging task for the original model. SAM 2 has shown strong performance in object segmentation, but it faces difficulties in handling crowded scenes with fast-moving or self-occluding objects. The improvements in tracking accuracy are attributed to the incorporation of motion information and the enhanced memory selection mechanism. These advancements help SAMURAI better handle challenging scenarios, such as crowded scenes and occlusions, where the original SAM 2 model struggles. full paper:

TechPat's profile picture
TechPat1 year ago

Very cool! Isnโ€™t SAM 2 open source too?

Eswar RB's profile picture
Eswar RB1 year ago

Perhaps combination of different colour models can fetch promising results. Seems this is only on RGB, as in when smokes covers Samurai fails to capture the subject.

Rohan gupta's profile picture
Rohan gupta1 year ago

Accuracy is so crazy

kaiban's profile picture
kaiban1 year ago

Awesome simulation

Akshay ๐Ÿš€'s profile picture
Akshay ๐Ÿš€1 year ago

Great choice of the video to test it! Loved it!

FlameJack's profile picture
FlameJack1 year ago

Now this is what AI should be used for, not generative AI that is using resources without any reason other than a lack of care to learn to make things the human way that gives things meaning .

Brandon Tyler's profile picture
Brandon Tyler1 year ago

Iโ€™m curious if you understand how ai tracking works for technologies like hudle and veo for basketball?

Related Videos

Everyone is sleeping on Meta's SAM 3 release. But it's actually a big deal. Here's why: Companies spend millions paying humans to label images and videos frame by frame. A single autonomous driving dataset? Months of work, hundreds of annotators, millions in cost. Without labeled data, you can't train custom models. Without custom models, you're stuck with generic solutions. This is why most companies never move past pilots. SAM 3 breaks this cycle. First let's look at the evolution: SAM 1 segmented objects when you clicked on them. Revolutionary, but one object at a time. SAM 2 added video tracking with memory. Game-changing, but you still manually prompted every object. SAM 3 changes everything with text prompts. Type "yellow school bus" and it finds ALL of them in your image or video. Not just one. Every instance across thousands of frames. Now here's where people get confused: "Can't I just use GPT-5 or Gemini for this?" No, and here's why that's a terrible approach. Large multimodal LLMs are great for reasoning, but they're slow and expensive for production visual tasks. You're paying API costs per image, waiting seconds for responses, getting inconsistent results. SAM 3 runs in 30 milliseconds on a single GPU for 100+ objects. That's 100x faster, and you own the infrastructure. More importantly, SAM 3 gives you precise pixel-level masks, not descriptions. Try asking an LLM to segment every defective part on a manufacturing line in real-time. It won't work. SAM 3 does this effortlessly. The real breakthrough is their data engine. Meta built an AI-human hybrid system that's 5x faster for complex annotations. They trained SAM 3 on 4 million unique visual concepts - 50x more than existing benchmarks like LVIS. SAM 3 is trained on 4 million unique visual concepts, it handles everything: - Text-based concept search - Interactive refinement with clicks - Video tracking across frames - Zero-shot detection of new concepts The model is open source. Weights, code, and benchmarks are on GitHub. If you're building computer vision applications, this is the foundation model to evaluate. The annotation time savings alone will pay for integration costs within weeks. Find the relevant links in the next tweet!

Akshay ๐Ÿš€

46,404 views โ€ข 7 months ago