正在加载视频...

视频加载失败

Introducing the Describe Anything Model (DAM), a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code, models, demo, data, and benchmark at:

34,975 次观看 • 1 年前 •via X (Twitter)

15 条评论

Yin Cui 的头像
Yin Cui1 年前

Detailed Localized Captioning (DLC) task generates rich, context-aware descriptions of specific regions, focusing on fine details like texture, color, shape, and distinctive features — unlike captioning, which broadly summarizes the whole scene.

Yin Cui 的头像
Yin Cui1 年前

DLC extends naturally to videos by describing how a specified region's appearance and context change over time. Models must track the target across frames, capturing evolving attributes, interactions, and subtle transformations.

Yin Cui 的头像
Yin Cui1 年前

Compared to prior works, the descriptions from our Describe Anything Model (DAM) are more detailed and accurate.

Yin Cui 的头像
Yin Cui1 年前

Our architecture uses a "Focal Prompt" to provide both the full image and a zoomed-in view of the target region, producing detailed, accurate captions that reflect both the bigger picture and the smallest nuances.

Yin Cui 的头像
Yin Cui1 年前

We introduce a localized vision backbone that integrates global and focal features. Images and masks are aligned spatially, and gated cross-attention layers fuse detailed local cues with global context. New parameters are initialized to zero, preserving pre-trained capabilities.

Yin Cui 的头像
Yin Cui1 年前

Existing datasets lack detailed localized descriptions, we devised a two-stage Semi-supervised-learning-based Data Pipeline, DLC-SDP: 1. We use a VLM to expand short class labels from segmentation datasets into rich descriptions. 2. We apply self-training as a form of semi-supervised learning on unlabeled images, using our model to generate and refine new captions. This scalable approach builds large, high-quality training data without relying on extensive human annotation.

Yin Cui 的头像
Yin Cui1 年前

To evaluate the task, we build DLC-Bench, a benchmark that uses an LLM-based judge to evaluate region-based descriptions by assessing correct details and the absence of errors. This offers a more precise metric for measuring Detailed Localized Captioning (DLC) performance.

Yin Cui 的头像
Yin Cui1 年前

The table summarizes the advantages of our proposed DAM, DLC-SDP, and DLC-Bench compared to prior practices.

Yin Cui 的头像
Yin Cui1 年前

Our model outperforms previous API-only models, open-source models, and region-specific VLMs on the detailed localized captioning (DLC) task.

Yin Cui 的头像
Yin Cui1 年前

Try DAM-3B on our Hugging Face interactive demo:

Yin Cui 的头像
Yin Cui1 年前

Contributors: Our amazing intern @LongTonyLian and Yifan Ding, @GeYunhao , @Sifei30488L , @hanna_mao , @Boyiliee , @drmapavone , @liu_mingyu , @trevordarrell , @YalaTweets Excited to see how the community uses DAM to push the boundaries of localized image/video understanding.

Omar Alama عمر الأعمى 的头像
Omar Alama عمر الأعمى1 年前

Playing around with the demo and it seems very impressive ! Can already see people using it in the semantic scene graph space. Wonder what's under the hood for efficiency. Can DAM compute features once and allows querying of different parts without recomputing? Kind of like SAM?

Yin Cui 的头像
Yin Cui1 年前

Thanks! That’s great question! We already pre-computed and cached images features for segmentation masks as in SAM. For our LLM backbone, It’s possible to save image region tokens in KV cache or even pre-compute all the text responses using the same text prompt. But doing this takes a lot of time (each needs an inference of a 3B LLM) therefore it’s not good for an interactive demo.

Omar Alama عمر الأعمى 的头像
Omar Alama عمر الأعمى1 年前

Got it. Pretty cool still !

Rainmaker 的头像
Rainmaker2 年前

In this free Substack post I share code for several machine learning models and engage in hyperparameter tuning that yields a model that delivers superior returns in the Gold market.

相关视频