Loading video...
Video Failed to Load
CVPR 2025 papers pt. 2 - SAMWISE SAMWISE adds language understanding and temporal reasoning to SAM2; you can segment and track objects in videos just by describing them more papers: ↓ more
20,528 views • 1 year ago •via X (Twitter)
9 Comments

- paper: - code: - video:

SAM2 supports visual prompts like points and boxes but have no native support for text prompts. I often showed how combining SAM2 with VLMs enabled language-guided image segmentation. SAMWISE allows direct text-driven video object segmentation.

SAM2 can make mistakes that, without human correction, will persist in subsequent frames. SAMWISE can auto correct it's own mistakes.

SAMWISE uses a frozen Segment Anything 2 (SAM2) model and a frozen text encoder. it adds a special module called the Cross-Modal Temporal Adapter (CMT), which helps the model combine information from both the video and the text and follow changes over time.

Conditional Memory Encoder (CME) helps the model notice when a new object fits your prompt better, so SAMWISE can automatically switch tracking, even if the correct object appears later or is hidden for a while.

full poster explaining text understanding, temporal modeling, tracking bias, and much more

In 2025, how much more money can you make in the stock market by following the most accurate analysis?

Can it do this in real-time?

Oh we are DEFINITELY taking a look at this! Wow!!
