正在加载视频...

视频加载失败

CVPR 2025 papers pt. 2 - SAMWISE SAMWISE adds language understanding and temporal reasoning to SAM2; you can segment and track objects in videos just by describing them more papers: ↓ more

20,528 次观看 • 1 年前 •via X (Twitter)

9 条评论

SkalskiP @ CVPR2025 🇺🇸 的头像
SkalskiP @ CVPR2025 🇺🇸1 年前

- paper: - code: - video:

SkalskiP @ CVPR2025 🇺🇸 的头像
SkalskiP @ CVPR2025 🇺🇸1 年前

SAM2 supports visual prompts like points and boxes but have no native support for text prompts. I often showed how combining SAM2 with VLMs enabled language-guided image segmentation. SAMWISE allows direct text-driven video object segmentation.

SkalskiP @ CVPR2025 🇺🇸 的头像
SkalskiP @ CVPR2025 🇺🇸1 年前

SAM2 can make mistakes that, without human correction, will persist in subsequent frames. SAMWISE can auto correct it's own mistakes.

SkalskiP @ CVPR2025 🇺🇸 的头像
SkalskiP @ CVPR2025 🇺🇸1 年前

SAMWISE uses a frozen Segment Anything 2 (SAM2) model and a frozen text encoder. it adds a special module called the Cross-Modal Temporal Adapter (CMT), which helps the model combine information from both the video and the text and follow changes over time.

SkalskiP @ CVPR2025 🇺🇸 的头像
SkalskiP @ CVPR2025 🇺🇸1 年前

Conditional Memory Encoder (CME) helps the model notice when a new object fits your prompt better, so SAMWISE can automatically switch tracking, even if the correct object appears later or is hidden for a while.

SkalskiP @ CVPR2025 🇺🇸 的头像
SkalskiP @ CVPR2025 🇺🇸1 年前

full poster explaining text understanding, temporal modeling, tracking bias, and much more

Nigam Arora 的头像
Nigam Arora1 年前

In 2025, how much more money can you make in the stock market by following the most accurate analysis?

Team Reagent 的头像
Team Reagent1 年前

Can it do this in real-time?

Team Reagent 的头像
Team Reagent1 年前

Oh we are DEFINITELY taking a look at this! Wow!!

相关视频