Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video... segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.show more

AK

469,377 subscribers

305,560 views • 2 years ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

10 Comments

Enrique Moreno2 years ago

Next phase is to track the traffic in India. If you can do that, you have perfected the technology.

kache2 years ago

project page

B0tak 👺 Zaddy2 years ago

A lot of word salad to me. Should of listened more at school.

Christopher Moonlight Productions2 years ago

Can it be an extension of Automatic 1111? This is rad.

Alessandro Lamberti2 years ago

Is the code available? Seems amazing!

Egido Val2 years ago

wow.

T2 years ago

poor beings

WHNBH2 years ago

@SaveToNotion #tweet #ai

Max Ivy2 years ago

We should consider the computational overhead of using bi-directional propagation in real-time applications. How should it scale with longer videos or higher resolutions?

Not Financial Advice2 years ago

What do the numbers represent,,,, .71,,, .57, etc?

Related Videos

Track Anything: Segment Anything Meets Videos Track-Anything is a flexible and interactive tool for video object tracking and segmentation suitable for: - Video object tracking and segmentation with shot changes. - Visualized development and data annnotation for video object tracking and segmentation. - Object-centric downstream video tasks, such as video inpainting and editing. abs: github:

Track Anything: Segment Anything Meets Videos Track-Anything is a flexible and interactive tool for video object tracking and segmentation suitable for: - Video object tracking and segmentation with shot changes. - Visualized development and data annnotation for video object tracking and segmentation. - Object-centric downstream video tasks, such as video inpainting and editing. abs: github:

AK

578,577 views • 3 years ago

The Segment Anything Model (SAM) by Meta AI is a step toward the first foundation model for image segmentation. SAM is capable of one-click segmentation of any object from photos or videos + zero-shot transfer to other segmentation tasks. Try the demo ➡️

The Segment Anything Model (SAM) by Meta AI is a step toward the first foundation model for image segmentation. SAM is capable of one-click segmentation of any object from photos or videos + zero-shot transfer to other segmentation tasks. Try the demo ➡️

AI at Meta

186,324 views • 3 years ago

Bussing plates after a healthy meal. Our robot only eats plastic fruits and veggies for now. Video segmentation accomplished using Track-Anything, an architecture combining Meta's SAM to generate a zero-shot segmentation prompt and Xmem to enable long-horizon, temporally consistent video segmentation masks.

Bussing plates after a healthy meal. Our robot only eats plastic fruits and veggies for now. Video segmentation accomplished using Track-Anything, an architecture combining Meta's SAM to generate a zero-shot segmentation prompt and Xmem to enable long-horizon, temporally consistent video segmentation masks.

Watney Robotics

14,856 views • 2 years ago

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Contributions: • We propose SuperGSeg: a 3D segmentation method with neural Gaussians, designed to learn hierarchical instance segmentation features from 2D foundation models. • We introduce the concept of Super-Gaussian, a novel representation that integrates hierarchical instance segmentation features, enabling the embedding of high-dimensional language features. This approach addresses previously unfeasible challenges in representing complex scenes with rich semantic details. • Extensive experiments on the LERF-OVS and ScanNet datasets demonstrate the effectiveness of the proposed method, achieving significant improvements in open-vocabulary 3D object-level and scene-level semantic segmentation. It shows particular strength in capturing fine-grained scene details and dense pixel semantic segmentation tasks for the first time.

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Contributions: • We propose SuperGSeg: a 3D segmentation method with neural Gaussians, designed to learn hierarchical instance segmentation features from 2D foundation models. • We introduce the concept of Super-Gaussian, a novel representation that integrates hierarchical instance segmentation features, enabling the embedding of high-dimensional language features. This approach addresses previously unfeasible challenges in representing complex scenes with rich semantic details. • Extensive experiments on the LERF-OVS and ScanNet datasets demonstrate the effectiveness of the proposed method, achieving significant improvements in open-vocabulary 3D object-level and scene-level semantic segmentation. It shows particular strength in capturing fine-grained scene details and dense pixel semantic segmentation tasks for the first time.

MrNeRF

13,594 views • 1 year ago

Agricultural Field Boundary Delineation (Instance Segmentation) with GeoAI Learn an end-to-end GeoAI workflow for agricultural field boundary delineation using instance segmentation and the Fields of the World dataset ( This workflow can be adapted to detect other object types (not just field boundaries) as long as training data is available. Video tutorial: Notebook: Web app: #geoai #geospatial #deeplearing

Agricultural Field Boundary Delineation (Instance Segmentation) with GeoAI Learn an end-to-end GeoAI workflow for agricultural field boundary delineation using instance segmentation and the Fields of the World dataset ( This workflow can be adapted to detect other object types (not just field boundaries) as long as training data is available. Video tutorial: Notebook: Web app: #geoai #geospatial #deeplearing

Qiusheng Wu

19,922 views • 3 months ago

🚀 The GeoAI QGIS Plugin is here 🔥 You can run Moondream vision-language models, object detection, image segmentation (SAM 3), and even train your own geospatial segmentation model end-to-end. Website: GitHub: Short demo: Full video tutorial: #QGIS #GeoAI #SAM3 #Geospatial #DeepLearning #ComputerVision #OpenSource #Python

🚀 The GeoAI QGIS Plugin is here 🔥 You can run Moondream vision-language models, object detection, image segmentation (SAM 3), and even train your own geospatial segmentation model end-to-end. Website: GitHub: Short demo: Full video tutorial: #QGIS #GeoAI #SAM3 #Geospatial #DeepLearning #ComputerVision #OpenSource #Python

Qiusheng Wu

11,682 views • 6 months ago

Super excited to introduce SAM2 Studio! 🚀🤖 I've been getting a lot of questions lately on supporting AI inference tailored for patient data and sensitive workflows. We optimized SAM2 to run completely on-device in real time for all of your medical segmentation workflows - including surgical video segmentation, radiographs, pathology slides and more!

Super excited to introduce SAM2 Studio! 🚀🤖 I've been getting a lot of questions lately on supporting AI inference tailored for patient data and sensitive workflows. We optimized SAM2 to run completely on-device in real time for all of your medical segmentation workflows - including surgical video segmentation, radiographs, pathology slides and more!

Cyril Zakka, MD

19,814 views • 1 year ago

🚀 The Segment Anything Model (SAM) has been upgraded to SAM2, featuring an efficient image encoder for segmenting images and videos. But does SAM2 outperform SAM1 in medical image and video segmentation? We're thrilled to present our paper "Segment Anything in Medical Images and Videos: Benchmark and Deployment"! We comprehensively benchmark SAM2 across 11 medical image modalities and videos. 📄 Paper: 💻 Code: **Highlights:** 1. SAM2 doesn’t always outperform SAM1 in 2D medical images, but excels in video segmentation, making it more accurate and efficient for 3D images, such as CT and MR scans. 2. MedSAM still outperforms SAM2 on most 2D modalities, but SAM2 surpasses MedSAM for 3D image segmentation in a slice-by-slice approach. 3. Segmentation performance varies with model size; sometimes the smallest model outperforms larger ones. 4. Fine-tuning SAM2 significantly boosts its performance for medical image segmentation. While SAM2 may struggle with challenging objects that have unclear boundaries or low contrast, it excels in generating good initial segmentation masks for common medical images and videos. However, the official interface doesn’t support medical data formats and has limitations on video length. To address this, we've developed a 3D Slicer Plugin and Gradio API for efficient 3D medical image and video segmentation. We invite you to try them out and provide feedback! 🔧 Deployment: - 3D Slicer Plugin: - Gradio API: (Note: Due to GPU limitations, the online API is available for only 12 hours and may be slow. We highly recommend deploying the Gradio API with your own computing resources: A big shoutout to Jun Ma (JunMa) who recently joined our UHN AI hub (UHN AI Hub) as Machine Learning Lead, and kudos to all co-authors: Sumin Kim, Feifei Li, Mohammed Baharoon (Mohammed Baharoon), Reza Asakereh, and Hongwei Lyu! This is true teamwork! Looking forward to collaborating with the community to advance 3D medical image and video segmentation foundation models! University Health Network U of T Department of Computer Science Department of Laboratory Medicine & Pathobiology Temerty Centre for AI in Medicine (T-CAIREM) Vector Institute #MedTech #AIinHealthcare #DeepLearning #MedicalImaging #SAM2 #MedSAM #AIResearch

🚀 The Segment Anything Model (SAM) has been upgraded to SAM2, featuring an efficient image encoder for segmenting images and videos. But does SAM2 outperform SAM1 in medical image and video segmentation? We're thrilled to present our paper "Segment Anything in Medical Images and Videos: Benchmark and Deployment"! We comprehensively benchmark SAM2 across 11 medical image modalities and videos. 📄 Paper: 💻 Code: Highlights: 1. SAM2 doesn’t always outperform SAM1 in 2D medical images, but excels in video segmentation, making it more accurate and efficient for 3D images, such as CT and MR scans. 2. MedSAM still outperforms SAM2 on most 2D modalities, but SAM2 surpasses MedSAM for 3D image segmentation in a slice-by-slice approach. 3. Segmentation performance varies with model size; sometimes the smallest model outperforms larger ones. 4. Fine-tuning SAM2 significantly boosts its performance for medical image segmentation. While SAM2 may struggle with challenging objects that have unclear boundaries or low contrast, it excels in generating good initial segmentation masks for common medical images and videos. However, the official interface doesn’t support medical data formats and has limitations on video length. To address this, we've developed a 3D Slicer Plugin and Gradio API for efficient 3D medical image and video segmentation. We invite you to try them out and provide feedback! 🔧 Deployment: - 3D Slicer Plugin: - Gradio API: (Note: Due to GPU limitations, the online API is available for only 12 hours and may be slow. We highly recommend deploying the Gradio API with your own computing resources: A big shoutout to Jun Ma (JunMa) who recently joined our UHN AI hub (UHN AI Hub) as Machine Learning Lead, and kudos to all co-authors: Sumin Kim, Feifei Li, Mohammed Baharoon (Mohammed Baharoon), Reza Asakereh, and Hongwei Lyu! This is true teamwork! Looking forward to collaborating with the community to advance 3D medical image and video segmentation foundation models! University Health Network U of T Department of Computer Science Department of Laboratory Medicine & Pathobiology Temerty Centre for AI in Medicine (T-CAIREM) Vector Institute #MedTech #AIinHealthcare #DeepLearning #MedicalImaging #SAM2 #MedSAM #AIResearch

Bo Wang

178,455 views • 1 year ago

Here’s a sneak peek using Rerun and Gradio for data annotation. It uses Video Depth Anything and Segment Anything 2 under the hood to generate segmentation masks and depth maps/point clouds. More to share next week.

Here’s a sneak peek using Rerun and Gradio for data annotation. It uses Video Depth Anything and Segment Anything 2 under the hood to generate segmentation masks and depth maps/point clouds. More to share next week.

Pablo Vela

36,719 views • 1 year ago

To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation paper page: The goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework's encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.

To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation paper page: The goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework's encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.

AK

18,871 views • 2 years ago

Segment Anything is fun and powerful but still requires full mask annotation for supervised training. Our new work MaskFreeVIS appearing at #CVPR2023 shows we can get accurate video instance segmentation WITHOUT any video or image mask annotations. More at

Segment Anything is fun and powerful but still requires full mask annotation for supervised training. Our new work MaskFreeVIS appearing at #CVPR2023 shows we can get accurate video instance segmentation WITHOUT any video or image mask annotations. More at

Fisher Yu

41,103 views • 3 years ago

VideoMaMa from KAIST: A mask-to-matte model, converts coarse video segmentation into pixel-accurate alpha mattes; looks quite good. Based on SVD+DINOv3.

VideoMaMa from KAIST: A mask-to-matte model, converts coarse video segmentation into pixel-accurate alpha mattes; looks quite good. Based on SVD+DINOv3.

Wildminder

19,652 views • 5 months ago

FlowRVS - segmentation as a continuous deformation, mapping video latents directly to masks via an ODE. Built on Wan’s T2V. - complex semantic understanding with temporal consistency. - no flickering

FlowRVS - segmentation as a continuous deformation, mapping video latents directly to masks via an ODE. Built on Wan’s T2V. - complex semantic understanding with temporal consistency. - no flickering

Wildminder

26,559 views • 3 months ago

Meet SAM 3, a unified model that enables detection, segmentation, and tracking of objects across images and videos. SAM 3 introduces some of our most highly requested features like text and exemplar prompts to segment all objects of a target category. Learnings from SAM 3 will help power new features in Instagram Edits and Vibes, bringing advanced segmentation capabilities directly to creators. 🔗 Learn more:

Meet SAM 3, a unified model that enables detection, segmentation, and tracking of objects across images and videos. SAM 3 introduces some of our most highly requested features like text and exemplar prompts to segment all objects of a target category. Learnings from SAM 3 will help power new features in Instagram Edits and Vibes, bringing advanced segmentation capabilities directly to creators. 🔗 Learn more:

AI at Meta

189,813 views • 7 months ago

Accepted by #CVPR2023! X-Decoder is the FIRST generalist decoder that supports all segmentation tasks (ins/sem/pano/ref) in OPEN VOCABULARY, both inter- AND intra-image VL tasks, and even helps instruct image inpainting/editing! New demo below and more at

Accepted by #CVPR2023! X-Decoder is the FIRST generalist decoder that supports all segmentation tasks (ins/sem/pano/ref) in OPEN VOCABULARY, both inter- AND intra-image VL tasks, and even helps instruct image inpainting/editing! New demo below and more at

Jianwei Yang

51,930 views • 3 years ago

Stable Diffusion generates beautiful images, but can it be used for open-world recognition? Try Demo! Our #CVPR2023 paper shows that the pre-trained diffusion model indeed is a good image parser, allows for open-vocabulary segmentation and detection.

Stable Diffusion generates beautiful images, but can it be used for open-world recognition? Try Demo! Our #CVPR2023 paper shows that the pre-trained diffusion model indeed is a good image parser, allows for open-vocabulary segmentation and detection.

Xiaolong Wang

241,225 views • 3 years ago

🚀 A brand-new feature is coming to the SAMGeo Python package: interactive remote-sensing image segmentation with text prompts powered by SAM3. This update makes geospatial segmentation even more intuitive: just describe what you want to detect, and let GeoAI do the rest. ✨ GitHub PR: #GeoAI #SAM3 #OpenSource #Python

Qiusheng Wu

32,869 views • 6 months ago

Three new open-source models just landed in ComfyUI natively: → Gemma 4 (Google DeepMind) - multimodal LLM handling text, image, audio, and video input with built-in step-by-step reasoning mode → VOID (Netflix) - video object removal that also erases shadows, reflections, and physical interactions caused by the removed subject → BiRefNet - high-res background & object segmentation, one of the most-used segmentation models in the ecosystem Workflows and blog linked below 👇

Three new open-source models just landed in ComfyUI natively: → Gemma 4 (Google DeepMind) - multimodal LLM handling text, image, audio, and video input with built-in step-by-step reasoning mode → VOID (Netflix) - video object removal that also erases shadows, reflections, and physical interactions caused by the removed subject → BiRefNet - high-res background & object segmentation, one of the most-used segmentation models in the ecosystem Workflows and blog linked below 👇

ComfyUI

218,300 views • 1 month ago

SAM 2 from Meta FAIR is the first unified model for real-time, promptable object segmentation in images & videos. Using the model in our web-based demo you can segment, track and apply effects to objects in video in just a few clicks. Try SAM 2 ➡️

SAM 2 from Meta FAIR is the first unified model for real-time, promptable object segmentation in images & videos. Using the model in our web-based demo you can segment, track and apply effects to objects in video in just a few clicks. Try SAM 2 ➡️

AI at Meta

88,918 views • 1 year ago