Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

SAMURAI vs. MetaAI's SAM 2! Traditional visual object tracking struggles in crowded, fast-moving, or self-occluded scenes, as does SAM2. Meet SAMURAI: a completely open-source adaptation of the Segment Anything Model for zero-shot visual tracking! Here's why it's a game-changer: 🚫 No need for retraining or finetuning 🎯 Boosts success... rate and precision 🤖 Motion-aware memory selection 💪 Zero-shot performance on diverse datasets But that's not all: 🔬 Refines mask selection 🔮 Predicts object motion effectively 📈 Gains: 7.1% AUC on LaSOT, 3.5% AO on GOT-10k 🏆 Competes with fully supervised methods without extra training Link to the GitHub repo in the next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights & tutorials on AI and Machine Learning.show more

Akshay 🚀

276,312 subscribers

363,264 views • 1 year ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

9 Comments

Akshay 🚀1 year ago

GitHub repo: _____ Interested in ML/AI Engineering? Sign up for our newsletter for in-depth lessons and get a FREE eBook with 150+ core DS/ML lessons:

BensenHsu1 year ago

The paper focuses on adapting the Segment Anything Model 2 (SAM 2) for visual object tracking, which is a challenging task for the original model. SAM 2 has shown strong performance in object segmentation, but it faces difficulties in handling crowded scenes with fast-moving or self-occluding objects. The improvements in tracking accuracy are attributed to the incorporation of motion information and the enhanced memory selection mechanism. These advancements help SAMURAI better handle challenging scenarios, such as crowded scenes and occlusions, where the original SAM 2 model struggles. full paper:

TechPat1 year ago

Very cool! Isn’t SAM 2 open source too?

Eswar RB1 year ago

Perhaps combination of different colour models can fetch promising results. Seems this is only on RGB, as in when smokes covers Samurai fails to capture the subject.

Rohan gupta1 year ago

Accuracy is so crazy

kaiban1 year ago

Awesome simulation

Akshay 🚀1 year ago

Great choice of the video to test it! Loved it!

FlameJack1 year ago

Now this is what AI should be used for, not generative AI that is using resources without any reason other than a lack of care to learn to make things the human way that gives things meaning .

Brandon Tyler1 year ago

I’m curious if you understand how ai tracking works for technologies like hudle and veo for basketball?

Related Videos

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware check out this SAM2 vs SAMURAI comparison! - paper: - code: - license: Apache-2.0

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware check out this SAM2 vs SAMURAI comparison! - paper: - code: - license: Apache-2.0

SkalskiP

124,358 views • 1 year ago

✨ CVPR 2025 highlight: A Distractor-Aware Memory for Visual Object Tracking with SAM2 the authors propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness 🏡 (1/n)🧵👇

✨ CVPR 2025 highlight: A Distractor-Aware Memory for Visual Object Tracking with SAM2 the authors propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness 🏡 (1/n)🧵👇

GeekyRakshit (e/mad)

32,669 views • 1 year ago

SAMURAI gives SAM 2 motion-aware memory! And the results look mind-blowing. This is zero-shot 🤯 Links ⬇️d

SAMURAI gives SAM 2 motion-aware memory! And the results look mind-blowing. This is zero-shot 🤯 Links ⬇️d

Dreaming Tulpa 🥓👑

116,471 views • 1 year ago

Microsoft has launched a powerful new data analysis tool! Introducing Data Formulator, a 100% open-source LLM-powered, no-code tool that transforms data in a snap and creates stunning visualizations. Key features include: 🤖 AI-powered data transformation 🖱️ Interactive drag-and-drop UI for visualizations 💬 Seamless blend of UI & natural language inputs But that’s not all: You can even create charts beyond your initial dataset. Data Formulator automatically identifies extra computation needs, generates fields for you, and outputs the final visualization. Find the GitHub repo in the next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights and tutorials on AI and Machine Learning.

Microsoft has launched a powerful new data analysis tool! Introducing Data Formulator, a 100% open-source LLM-powered, no-code tool that transforms data in a snap and creates stunning visualizations. Key features include: 🤖 AI-powered data transformation 🖱️ Interactive drag-and-drop UI for visualizations 💬 Seamless blend of UI & natural language inputs But that’s not all: You can even create charts beyond your initial dataset. Data Formulator automatically identifies extra computation needs, generates fields for you, and outputs the final visualization. Find the GitHub repo in the next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights and tutorials on AI and Machine Learning.

Akshay 🚀

280,449 views • 1 year ago

Remarkably lifelike motion and fluidity. BeyondMimic is a framework for training humanoid whole-body control from large mocap datasets. First, an open-source motion-tracking pipeline to reproduce diverse, highly dynamic human skills on real hardware, then distilling them into a guided state-action diffusion model for zero-shot, task-specific control. Project page:

Remarkably lifelike motion and fluidity. BeyondMimic is a framework for training humanoid whole-body control from large mocap datasets. First, an open-source motion-tracking pipeline to reproduce diverse, highly dynamic human skills on real hardware, then distilling them into a guided state-action diffusion model for zero-shot, task-specific control. Project page:

The Humanoid Hub

58,081 views • 10 months ago

Track Anything: Segment Anything Meets Videos Track-Anything is a flexible and interactive tool for video object tracking and segmentation suitable for: - Video object tracking and segmentation with shot changes. - Visualized development and data annnotation for video object tracking and segmentation. - Object-centric downstream video tasks, such as video inpainting and editing. abs: github:

Track Anything: Segment Anything Meets Videos Track-Anything is a flexible and interactive tool for video object tracking and segmentation suitable for: - Video object tracking and segmentation with shot changes. - Visualized development and data annnotation for video object tracking and segmentation. - Object-centric downstream video tasks, such as video inpainting and editing. abs: github:

AK

578,577 views • 3 years ago

Turn any GitHub repository into LLM-ready text! Simply replace "hub" with "ingest" in a GitHub URL and receive a prompt-friendly text ingest for LLMs. Gitingest is 100% open-source and provides: - Directory structure - A brief summary of the project - The entire content as LLM-ready text Plus, it comes with a nice python package and you can run the UI locally! Stay tuned, I'm working on something really cool with this!✨ Link to the GitHub repo in the next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights and tutorials on ML and AI Engineering!

Turn any GitHub repository into LLM-ready text! Simply replace "hub" with "ingest" in a GitHub URL and receive a prompt-friendly text ingest for LLMs. Gitingest is 100% open-source and provides: - Directory structure - A brief summary of the project - The entire content as LLM-ready text Plus, it comes with a nice python package and you can run the UI locally! Stay tuned, I'm working on something really cool with this!✨ Link to the GitHub repo in the next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights and tutorials on ML and AI Engineering!

Akshay 🚀

191,335 views • 1 year ago

Everyone is sleeping on Meta's SAM 3 release. But it's actually a big deal. Here's why: Companies spend millions paying humans to label images and videos frame by frame. A single autonomous driving dataset? Months of work, hundreds of annotators, millions in cost. Without labeled data, you can't train custom models. Without custom models, you're stuck with generic solutions. This is why most companies never move past pilots. SAM 3 breaks this cycle. First let's look at the evolution: SAM 1 segmented objects when you clicked on them. Revolutionary, but one object at a time. SAM 2 added video tracking with memory. Game-changing, but you still manually prompted every object. SAM 3 changes everything with text prompts. Type "yellow school bus" and it finds ALL of them in your image or video. Not just one. Every instance across thousands of frames. Now here's where people get confused: "Can't I just use GPT-5 or Gemini for this?" No, and here's why that's a terrible approach. Large multimodal LLMs are great for reasoning, but they're slow and expensive for production visual tasks. You're paying API costs per image, waiting seconds for responses, getting inconsistent results. SAM 3 runs in 30 milliseconds on a single GPU for 100+ objects. That's 100x faster, and you own the infrastructure. More importantly, SAM 3 gives you precise pixel-level masks, not descriptions. Try asking an LLM to segment every defective part on a manufacturing line in real-time. It won't work. SAM 3 does this effortlessly. The real breakthrough is their data engine. Meta built an AI-human hybrid system that's 5x faster for complex annotations. They trained SAM 3 on 4 million unique visual concepts - 50x more than existing benchmarks like LVIS. SAM 3 is trained on 4 million unique visual concepts, it handles everything: - Text-based concept search - Interactive refinement with clicks - Video tracking across frames - Zero-shot detection of new concepts The model is open source. Weights, code, and benchmarks are on GitHub. If you're building computer vision applications, this is the foundation model to evaluate. The annotation time savings alone will pay for integration costs within weeks. Find the relevant links in the next tweet!

Everyone is sleeping on Meta's SAM 3 release. But it's actually a big deal. Here's why: Companies spend millions paying humans to label images and videos frame by frame. A single autonomous driving dataset? Months of work, hundreds of annotators, millions in cost. Without labeled data, you can't train custom models. Without custom models, you're stuck with generic solutions. This is why most companies never move past pilots. SAM 3 breaks this cycle. First let's look at the evolution: SAM 1 segmented objects when you clicked on them. Revolutionary, but one object at a time. SAM 2 added video tracking with memory. Game-changing, but you still manually prompted every object. SAM 3 changes everything with text prompts. Type "yellow school bus" and it finds ALL of them in your image or video. Not just one. Every instance across thousands of frames. Now here's where people get confused: "Can't I just use GPT-5 or Gemini for this?" No, and here's why that's a terrible approach. Large multimodal LLMs are great for reasoning, but they're slow and expensive for production visual tasks. You're paying API costs per image, waiting seconds for responses, getting inconsistent results. SAM 3 runs in 30 milliseconds on a single GPU for 100+ objects. That's 100x faster, and you own the infrastructure. More importantly, SAM 3 gives you precise pixel-level masks, not descriptions. Try asking an LLM to segment every defective part on a manufacturing line in real-time. It won't work. SAM 3 does this effortlessly. The real breakthrough is their data engine. Meta built an AI-human hybrid system that's 5x faster for complex annotations. They trained SAM 3 on 4 million unique visual concepts - 50x more than existing benchmarks like LVIS. SAM 3 is trained on 4 million unique visual concepts, it handles everything: - Text-based concept search - Interactive refinement with clicks - Video tracking across frames - Zero-shot detection of new concepts The model is open source. Weights, code, and benchmarks are on GitHub. If you're building computer vision applications, this is the foundation model to evaluate. The annotation time savings alone will pay for integration costs within weeks. Find the relevant links in the next tweet!

Akshay 🚀

46,404 views • 7 months ago

Just read a neat AI paper called SAMURAI -- it takes SAM 2 (Meta's "segment anything" model) and makes it way better at tracking objects in videos. Basic problem is SAM 2 gets confused when things move fast or there's a crowd of similar objects (big problem for VFX and video intelligence alike). The fix? They basically gave it a sense of motion and smarter memory selection using our old friend the Kalman filter -- helps it remember where things were going instead of just where they are. No retraining needed and runs in realtime. Elegant solution that matches (and even beats!) a lot of purpose built tracking systems. Nice example of how sometimes you don't need fancy new architectures -- just smart ways to use what the model already knows, plus some tried & tested classical methods. Every time someone's like "we solved tracking!" I check the methods section and... yep, there's our boy Kalman, still crushing it after all these years. Sometimes the old school stuff just works.

Just read a neat AI paper called SAMURAI -- it takes SAM 2 (Meta's "segment anything" model) and makes it way better at tracking objects in videos. Basic problem is SAM 2 gets confused when things move fast or there's a crowd of similar objects (big problem for VFX and video intelligence alike). The fix? They basically gave it a sense of motion and smarter memory selection using our old friend the Kalman filter -- helps it remember where things were going instead of just where they are. No retraining needed and runs in realtime. Elegant solution that matches (and even beats!) a lot of purpose built tracking systems. Nice example of how sometimes you don't need fancy new architectures -- just smart ways to use what the model already knows, plus some tried & tested classical methods. Every time someone's like "we solved tracking!" I check the methods section and... yep, there's our boy Kalman, still crushing it after all these years. Sometimes the old school stuff just works.

Bilawal Sidhu

47,246 views • 1 year ago

The Segment Anything Model (SAM) by Meta AI is a step toward the first foundation model for image segmentation. SAM is capable of one-click segmentation of any object from photos or videos + zero-shot transfer to other segmentation tasks. Try the demo ➡️

The Segment Anything Model (SAM) by Meta AI is a step toward the first foundation model for image segmentation. SAM is capable of one-click segmentation of any object from photos or videos + zero-shot transfer to other segmentation tasks. Try the demo ➡️

AI at Meta

186,324 views • 3 years ago

The latest Visual Studio Code release brings auto model selection (preview) - a way to have the best model picked for you based on current capacity and performance. Currently being rolled out to all GitHub Copilot users in Visual Studio Code starting with individuals. Learn more:

The latest Visual Studio Code release brings auto model selection (preview) - a way to have the best model picked for you based on current capacity and performance. Currently being rolled out to all GitHub Copilot users in Visual Studio Code starting with individuals. Learn more:

Visual Studio Code

67,645 views • 9 months ago

'LATENT' learns tennis skills for humanoid robots from human motion data. The robot can sustain multi-shot rallies, handle ball speeds of 15+ m/s, and showed a 90.9% success rate for the forehand. No onboard cameras or vision models, relies on external MoCap for high-precision, low-latency ball tracking. Paper:

'LATENT' learns tennis skills for humanoid robots from human motion data. The robot can sustain multi-shot rallies, handle ball speeds of 15+ m/s, and showed a 90.9% success rate for the forehand. No onboard cameras or vision models, relies on external MoCap for high-precision, low-latency ball tracking. Paper:

The Humanoid Hub

63,824 views • 3 months ago

Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild. It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵

Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild. It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵

Ai2

85,809 views • 2 months ago

Segment Anything Model 2 (SAM 2) is a foundation model from Meta FAIR for promptable visual segmentation in images & videos. Available now for anyone to build on for free, open source under an Apache license. Try the demo ➡️

Segment Anything Model 2 (SAM 2) is a foundation model from Meta FAIR for promptable visual segmentation in images & videos. Available now for anyone to build on for free, open source under an Apache license. Try the demo ➡️

AI at Meta

97,733 views • 1 year ago

Star Wars cost billions to make. But today, you can create videos like this for just a few dollars. Filmmaking is going to be disrupted by AI sooner than we thought. If I were to take a shot at it, here's what I'd use: - o3-mini-high for writing the script - Imagen 3 or Flux Pro for images - Kling 1.6 for the video There are also excellent unified platforms like ChatLLM, where you can access all these resources in one place. For developers like us, there's more under the same offering: CodeLLM as a coding assistant, and RouteLLM automatically selects the best model for your chat queries. The company claims that they see thousands of humans becoming AI creators every day using their platform! Maybe it's a new role in the making! I have shared a link to ChatLLM in the next tweet! ______ Find me → Akshay 🚀 ✔️ For more insights and tutorials on AI and Machine Learning.

Star Wars cost billions to make. But today, you can create videos like this for just a few dollars. Filmmaking is going to be disrupted by AI sooner than we thought. If I were to take a shot at it, here's what I'd use: - o3-mini-high for writing the script - Imagen 3 or Flux Pro for images - Kling 1.6 for the video There are also excellent unified platforms like ChatLLM, where you can access all these resources in one place. For developers like us, there's more under the same offering: CodeLLM as a coding assistant, and RouteLLM automatically selects the best model for your chat queries. The company claims that they see thousands of humans becoming AI creators every day using their platform! Maybe it's a new role in the making! I have shared a link to ChatLLM in the next tweet! ______ Find me → Akshay 🚀 ✔️ For more insights and tutorials on AI and Machine Learning.

Akshay 🚀

52,457 views • 1 year ago

Meet Any2Track, a new two-stage RL framework designed to address a core challenge for humanoid robots: reliably tracking diverse, dynamic motions in the real world. The framework consists of two key components: AnyTracker, a general motion tracker, and AnyAdapter, a module that allows the robot to adapt to real-world disturbances like uneven terrain or external forces. This approach gives robots the stability needed for practical use. The system has been successfully deployed on a Unitree G1 humanoid robot with zero-shot sim2real transfer, performing exceptionally well in various motion tracking tasks under multiple real-world disturbances. Project: Codebase:

Meet Any2Track, a new two-stage RL framework designed to address a core challenge for humanoid robots: reliably tracking diverse, dynamic motions in the real world. The framework consists of two key components: AnyTracker, a general motion tracker, and AnyAdapter, a module that allows the robot to adapt to real-world disturbances like uneven terrain or external forces. This approach gives robots the stability needed for practical use. The system has been successfully deployed on a Unitree G1 humanoid robot with zero-shot sim2real transfer, performing exceptionally well in various motion tracking tasks under multiple real-world disturbances. Project: Codebase:

RoboHub🤖

61,378 views • 9 months ago

‼️All-New Shot Tracking! Looking for testers!‼️ We’re about to roll out an all-new on-course shot tracking experience. • Auto shot tracking on Apple Watch • Low-touch, intuitive tracking on iPhone • Easy in-round edits • Scorecards And more! Comment 👇🏼 for early access

‼️All-New Shot Tracking! Looking for testers!‼️ We’re about to roll out an all-new on-course shot tracking experience. • Auto shot tracking on Apple Watch • Low-touch, intuitive tracking on iPhone • Easy in-round edits • Scorecards And more! Comment 👇🏼 for early access

Shot Pattern

12,284 views • 2 months ago

Build your own local ChatGPT-like interface! Powered by Llama 3.2 Vision, it runs 100% locally and supports: ↳Text chat ↳Chat with images Tech stack: ↳ Chainlit for the UI ↳ Ollama for serving Llama 3.2 vision Everything is just 50 lines of code! I have shared link to the code in next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights and tutorials on ML and AI Engineering!

Build your own local ChatGPT-like interface! Powered by Llama 3.2 Vision, it runs 100% locally and supports: ↳Text chat ↳Chat with images Tech stack: ↳ Chainlit for the UI ↳ Ollama for serving Llama 3.2 vision Everything is just 50 lines of code! I have shared link to the code in next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights and tutorials on ML and AI Engineering!

Akshay 🚀

49,952 views • 1 year ago

Check out this drone: Joshua Bird built this drone ($20) in his dorm! [open-source motion capture system⬇️] Built at low cost, a motion capture system for tracking & and flying drones autonomously, with millimeter-level precision at room-scale. The student used $1 PS3 Eye cameras with 150fps capability. The challenge? PID tuning! It took him 4 days of crashes to get the drone to hover, but it's still wobbly. Used a 3x nested PID loop for precise control. This project led to his dissertation on visual SLAM! Full details - Algorithms for camera positioning & obstacle triangulation, in his YouTube video: Code & 3D files on GitHub: More projects at

Check out this drone: Joshua Bird built this drone ($20) in his dorm! [open-source motion capture system⬇️] Built at low cost, a motion capture system for tracking & and flying drones autonomously, with millimeter-level precision at room-scale. The student used $1 PS3 Eye cameras with 150fps capability. The challenge? PID tuning! It took him 4 days of crashes to get the drone to hover, but it's still wobbly. Used a 3x nested PID loop for precise control. This project led to his dissertation on visual SLAM! Full details - Algorithms for camera positioning & obstacle triangulation, in his YouTube video: Code & 3D files on GitHub: More projects at

Ilir Aliu

101,649 views • 1 year ago