Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Introducing the Describe Anything Model (DAM), a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code, models, demo, data, and benchmark at:

34,975 Aufrufe • vor 1 Jahr •via X (Twitter)

15 Kommentare

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

Detailed Localized Captioning (DLC) task generates rich, context-aware descriptions of specific regions, focusing on fine details like texture, color, shape, and distinctive features — unlike captioning, which broadly summarizes the whole scene.

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

DLC extends naturally to videos by describing how a specified region's appearance and context change over time. Models must track the target across frames, capturing evolving attributes, interactions, and subtle transformations.

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

Compared to prior works, the descriptions from our Describe Anything Model (DAM) are more detailed and accurate.

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

Our architecture uses a "Focal Prompt" to provide both the full image and a zoomed-in view of the target region, producing detailed, accurate captions that reflect both the bigger picture and the smallest nuances.

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

We introduce a localized vision backbone that integrates global and focal features. Images and masks are aligned spatially, and gated cross-attention layers fuse detailed local cues with global context. New parameters are initialized to zero, preserving pre-trained capabilities.

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

Existing datasets lack detailed localized descriptions, we devised a two-stage Semi-supervised-learning-based Data Pipeline, DLC-SDP: 1. We use a VLM to expand short class labels from segmentation datasets into rich descriptions. 2. We apply self-training as a form of semi-supervised learning on unlabeled images, using our model to generate and refine new captions. This scalable approach builds large, high-quality training data without relying on extensive human annotation.

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

To evaluate the task, we build DLC-Bench, a benchmark that uses an LLM-based judge to evaluate region-based descriptions by assessing correct details and the absence of errors. This offers a more precise metric for measuring Detailed Localized Captioning (DLC) performance.

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

The table summarizes the advantages of our proposed DAM, DLC-SDP, and DLC-Bench compared to prior practices.

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

Our model outperforms previous API-only models, open-source models, and region-specific VLMs on the detailed localized captioning (DLC) task.

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

Try DAM-3B on our Hugging Face interactive demo:

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

Contributors: Our amazing intern @LongTonyLian and Yifan Ding, @GeYunhao , @Sifei30488L , @hanna_mao , @Boyiliee , @drmapavone , @liu_mingyu , @trevordarrell , @YalaTweets Excited to see how the community uses DAM to push the boundaries of localized image/video understanding.

Profilbild von Omar Alama عمر الأعمى
Omar Alama عمر الأعمىvor 1 Jahr

Playing around with the demo and it seems very impressive ! Can already see people using it in the semantic scene graph space. Wonder what's under the hood for efficiency. Can DAM compute features once and allows querying of different parts without recomputing? Kind of like SAM?

Profilbild von Yin Cui
Yin Cuivor 1 Jahr

Thanks! That’s great question! We already pre-computed and cached images features for segmentation masks as in SAM. For our LLM backbone, It’s possible to save image region tokens in KV cache or even pre-compute all the text responses using the same text prompt. But doing this takes a lot of time (each needs an inference of a 3B LLM) therefore it’s not good for an interactive demo.

Profilbild von Omar Alama عمر الأعمى
Omar Alama عمر الأعمىvor 1 Jahr

Got it. Pretty cool still !

Profilbild von Rainmaker
Rainmakervor 2 Jahren

In this free Substack post I share code for several machine learning models and engage in hyperparameter tuning that yields a model that delivers superior returns in the Gold market.

Ähnliche Videos