Загрузка видео...

Не удалось загрузить видео

На главную

Introducing the Describe Anything Model (DAM), a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code, models, demo, data, and benchmark at:

34,975 просмотров • 1 год назад •via X (Twitter)

Комментарии: 15

Фото профиля Yin Cui
Yin Cui1 год назад

Detailed Localized Captioning (DLC) task generates rich, context-aware descriptions of specific regions, focusing on fine details like texture, color, shape, and distinctive features — unlike captioning, which broadly summarizes the whole scene.

Фото профиля Yin Cui
Yin Cui1 год назад

DLC extends naturally to videos by describing how a specified region's appearance and context change over time. Models must track the target across frames, capturing evolving attributes, interactions, and subtle transformations.

Фото профиля Yin Cui
Yin Cui1 год назад

Compared to prior works, the descriptions from our Describe Anything Model (DAM) are more detailed and accurate.

Фото профиля Yin Cui
Yin Cui1 год назад

Our architecture uses a "Focal Prompt" to provide both the full image and a zoomed-in view of the target region, producing detailed, accurate captions that reflect both the bigger picture and the smallest nuances.

Фото профиля Yin Cui
Yin Cui1 год назад

We introduce a localized vision backbone that integrates global and focal features. Images and masks are aligned spatially, and gated cross-attention layers fuse detailed local cues with global context. New parameters are initialized to zero, preserving pre-trained capabilities.

Фото профиля Yin Cui
Yin Cui1 год назад

Existing datasets lack detailed localized descriptions, we devised a two-stage Semi-supervised-learning-based Data Pipeline, DLC-SDP: 1. We use a VLM to expand short class labels from segmentation datasets into rich descriptions. 2. We apply self-training as a form of semi-supervised learning on unlabeled images, using our model to generate and refine new captions. This scalable approach builds large, high-quality training data without relying on extensive human annotation.

Фото профиля Yin Cui
Yin Cui1 год назад

To evaluate the task, we build DLC-Bench, a benchmark that uses an LLM-based judge to evaluate region-based descriptions by assessing correct details and the absence of errors. This offers a more precise metric for measuring Detailed Localized Captioning (DLC) performance.

Фото профиля Yin Cui
Yin Cui1 год назад

The table summarizes the advantages of our proposed DAM, DLC-SDP, and DLC-Bench compared to prior practices.

Фото профиля Yin Cui
Yin Cui1 год назад

Our model outperforms previous API-only models, open-source models, and region-specific VLMs on the detailed localized captioning (DLC) task.

Фото профиля Yin Cui
Yin Cui1 год назад

Try DAM-3B on our Hugging Face interactive demo:

Фото профиля Yin Cui
Yin Cui1 год назад

Contributors: Our amazing intern @LongTonyLian and Yifan Ding, @GeYunhao , @Sifei30488L , @hanna_mao , @Boyiliee , @drmapavone , @liu_mingyu , @trevordarrell , @YalaTweets Excited to see how the community uses DAM to push the boundaries of localized image/video understanding.

Фото профиля Omar Alama عمر الأعمى
Omar Alama عمر الأعمى1 год назад

Playing around with the demo and it seems very impressive ! Can already see people using it in the semantic scene graph space. Wonder what's under the hood for efficiency. Can DAM compute features once and allows querying of different parts without recomputing? Kind of like SAM?

Фото профиля Yin Cui
Yin Cui1 год назад

Thanks! That’s great question! We already pre-computed and cached images features for segmentation masks as in SAM. For our LLM backbone, It’s possible to save image region tokens in KV cache or even pre-compute all the text responses using the same text prompt. But doing this takes a lot of time (each needs an inference of a 3B LLM) therefore it’s not good for an interactive demo.

Фото профиля Omar Alama عمر الأعمى
Omar Alama عمر الأعمى1 год назад

Got it. Pretty cool still !

Фото профиля Rainmaker
Rainmaker2 лет назад

In this free Substack post I share code for several machine learning models and engage in hyperparameter tuning that yields a model that delivers superior returns in the Gold market.

Похожие видео