Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Introducing the Describe Anything Model (DAM), a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code, models, demo, data, and benchmark at:

Yin Cui

6,919 subscribers

34,975 görüntüleme • 1 yıl önce •via X (Twitter)

Eğitim Haberler & Politika Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

15 Yorum

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

Detailed Localized Captioning (DLC) task generates rich, context-aware descriptions of specific regions, focusing on fine details like texture, color, shape, and distinctive features — unlike captioning, which broadly summarizes the whole scene.

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

DLC extends naturally to videos by describing how a specified region's appearance and context change over time. Models must track the target across frames, capturing evolving attributes, interactions, and subtle transformations.

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

Compared to prior works, the descriptions from our Describe Anything Model (DAM) are more detailed and accurate.

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

Our architecture uses a "Focal Prompt" to provide both the full image and a zoomed-in view of the target region, producing detailed, accurate captions that reflect both the bigger picture and the smallest nuances.

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

We introduce a localized vision backbone that integrates global and focal features. Images and masks are aligned spatially, and gated cross-attention layers fuse detailed local cues with global context. New parameters are initialized to zero, preserving pre-trained capabilities.

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

Existing datasets lack detailed localized descriptions, we devised a two-stage Semi-supervised-learning-based Data Pipeline, DLC-SDP: 1. We use a VLM to expand short class labels from segmentation datasets into rich descriptions. 2. We apply self-training as a form of semi-supervised learning on unlabeled images, using our model to generate and refine new captions. This scalable approach builds large, high-quality training data without relying on extensive human annotation.

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

To evaluate the task, we build DLC-Bench, a benchmark that uses an LLM-based judge to evaluate region-based descriptions by assessing correct details and the absence of errors. This offers a more precise metric for measuring Detailed Localized Captioning (DLC) performance.

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

The table summarizes the advantages of our proposed DAM, DLC-SDP, and DLC-Bench compared to prior practices.

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

Our model outperforms previous API-only models, open-source models, and region-specific VLMs on the detailed localized captioning (DLC) task.

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

Try DAM-3B on our Hugging Face interactive demo:

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

Contributors: Our amazing intern @LongTonyLian and Yifan Ding, @GeYunhao , @Sifei30488L , @hanna_mao , @Boyiliee , @drmapavone , @liu_mingyu , @trevordarrell , @YalaTweets Excited to see how the community uses DAM to push the boundaries of localized image/video understanding.

Omar Alama عمر الأعمى profil fotoğrafı

Omar Alama عمر الأعمى1 yıl önce

Playing around with the demo and it seems very impressive ! Can already see people using it in the semantic scene graph space. Wonder what's under the hood for efficiency. Can DAM compute features once and allows querying of different parts without recomputing? Kind of like SAM?

Yin Cui profil fotoğrafı

Yin Cui1 yıl önce

Thanks! That’s great question! We already pre-computed and cached images features for segmentation masks as in SAM. For our LLM backbone, It’s possible to save image region tokens in KV cache or even pre-compute all the text responses using the same text prompt. But doing this takes a lot of time (each needs an inference of a 3B LLM) therefore it’s not good for an interactive demo.

Omar Alama عمر الأعمى profil fotoğrafı

Omar Alama عمر الأعمى1 yıl önce

Got it. Pretty cool still !

Rainmaker profil fotoğrafı

Rainmaker2 yıl önce

In this free Substack post I share code for several machine learning models and engage in hyperparameter tuning that yields a model that delivers superior returns in the Gold market.

Benzer Videolar

🧑‍🍳 Experiment with a multimodal LLM in WebAR. The AI identifies ingredients, sets up bounding boxes, and prepares them for AR. I'm also using open-source models to generate recipes and corresponding images. #threejs #8thwall #generativeai

🧑‍🍳 Experiment with a multimodal LLM in WebAR. The AI identifies ingredients, sets up bounding boxes, and prepares them for AR. I'm also using open-source models to generate recipes and corresponding images. #threejs #8thwall #generativeai

Stijn Spanhove

61,019 görüntüleme • 2 yıl önce

can AI write engaging news that people can trust? introducing ✨Data2Story: a data journalist agent. give it raw data, it generate a verifiable, multimodal article. 🔍verifiable: every claim is evidence-grounded, traces back to data, code, or a cited source. 🔮multimodal: the article is a generative UI — images, videos, audio, interactive charts. not just readable, but trustworthy and playable. 🧵1/N

can AI write engaging news that people can trust? introducing ✨Data2Story: a data journalist agent. give it raw data, it generate a verifiable, multimodal article. 🔍verifiable: every claim is evidence-grounded, traces back to data, code, or a cited source. 🔮multimodal: the article is a generative UI — images, videos, audio, interactive charts. not just readable, but trustworthy and playable. 🧵1/N

Kevin Lin

25,544 görüntüleme • 11 gün önce

Introducing LogoCreator! An open source logo generator that creates professional logos in seconds using Flux Pro 1.1 on Together AI. 100% free and open source. Demo + code:

Hassan

349,250 görüntüleme • 1 yıl önce

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Gordon Wetzstein

19,189 görüntüleme • 2 yıl önce

InstantSplat++ is now open source. It is a lightweight library that connects foundation models (VGGT, MASt3R, MAP-Anything, etc.) with the Gaussian splatting family. Given uncalibrated images, it optimizes a 3D scene in a few seconds. Try the demo and code here:

InstantSplat++ is now open source. It is a lightweight library that connects foundation models (VGGT, MASt3R, MAP-Anything, etc.) with the Gaussian splatting family. Given uncalibrated images, it optimizes a 3D scene in a few seconds. Try the demo and code here:

Zhiwen(Aaron) Fan

31,835 görüntüleme • 4 ay önce

⚡️ Introducing Bolt3D ⚡️ Bolt3D generates interactive 3D scenes in less than 7 seconds on a single GPU from one or more images. It features a latent diffusion model that *directly* generates 3D Gaussians of seen and unseen regions, without any test time optimization. 🧵👇 (1/9)

⚡️ Introducing Bolt3D ⚡️ Bolt3D generates interactive 3D scenes in less than 7 seconds on a single GPU from one or more images. It features a latent diffusion model that directly generates 3D Gaussians of seen and unseen regions, without any test time optimization. 🧵👇 (1/9)

Stan Szymanowicz

125,848 görüntüleme • 1 yıl önce

Segment Anything Model 2 (SAM 2) is a foundation model from Meta FAIR for promptable visual segmentation in images & videos. Available now for anyone to build on for free, open source under an Apache license. Try the demo ➡️

Segment Anything Model 2 (SAM 2) is a foundation model from Meta FAIR for promptable visual segmentation in images & videos. Available now for anyone to build on for free, open source under an Apache license. Try the demo ➡️

AI at Meta

97,733 görüntüleme • 1 yıl önce

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 görüntüleme • 2 yıl önce

Meet #DBRX: a general-purpose LLM that sets a new standard for efficient open source models. Use the DBRX model in your RAG apps or use the DBRX design to build your own custom LLMs and improve the quality of your GenAI applications.

Meet #DBRX: a general-purpose LLM that sets a new standard for efficient open source models. Use the DBRX model in your RAG apps or use the DBRX design to build your own custom LLMs and improve the quality of your GenAI applications.

Databricks

327,704 görüntüleme • 2 yıl önce

Meet Molmo: a family of open, state-of-the-art multimodal AI models. Our best model outperforms proprietary systems, using 1000x less data. Molmo doesn't just understand multimodal data—it acts on it, enabling rich interactions in both the physical and virtual worlds. Try it for yourself:

Meet Molmo: a family of open, state-of-the-art multimodal AI models. Our best model outperforms proprietary systems, using 1000x less data. Molmo doesn't just understand multimodal data—it acts on it, enabling rich interactions in both the physical and virtual worlds. Try it for yourself:

Ai2

515,379 görüntüleme • 1 yıl önce

I built an open source data analyst agent! Upload any CSV, ask a question, and get it answered with statistics or a nice chart. Launching in 24 hours. 100% free & open source. Under the hood, here's how it works: 1. User uploads a CSV and asks a question. 2. The app uses Together Code Interpreter to spin up a VM and uploads the CSV onto it. 3. I have an LLM (Qwen 3 Coder) write code to interpret the CSV using `pandas`, then solve the question the user asked. 4. Together Code Interpreter runs the code in a secure environment & returns a result. These results can be text (some kind of stats analysis) or a chart.

I built an open source data analyst agent! Upload any CSV, ask a question, and get it answered with statistics or a nice chart. Launching in 24 hours. 100% free & open source. Under the hood, here's how it works: 1. User uploads a CSV and asks a question. 2. The app uses Together Code Interpreter to spin up a VM and uploads the CSV onto it. 3. I have an LLM (Qwen 3 Coder) write code to interpret the CSV using `pandas`, then solve the question the user asked. 4. Together Code Interpreter runs the code in a secure environment & returns a result. These results can be text (some kind of stats analysis) or a chart.

Hassan

27,086 görüntüleme • 9 ay önce

Multimodal Reasoning AI Agents are here with Gemini 2.0 Flash Thinking I built a multimodal AI agent that can reason and understand images using gemini flash reasoning LLM. 100% Opensource Code with step-by-step tutorial.

Multimodal Reasoning AI Agents are here with Gemini 2.0 Flash Thinking I built a multimodal AI agent that can reason and understand images using gemini flash reasoning LLM. 100% Opensource Code with step-by-step tutorial.

Shubham Saboo

36,596 görüntüleme • 1 yıl önce

Introducing Open Deep Research! A fully open-source Deep Research tool that: • writes comprehensive reports • does multi-hop search and reasoning • generates cover images & pod-casts! We’re releasing everything: evaluation dataset, code and blog.🔥 Example output report👇

Introducing Open Deep Research! A fully open-source Deep Research tool that: • writes comprehensive reports • does multi-hop search and reasoning • generates cover images & pod-casts! We’re releasing everything: evaluation dataset, code and blog.🔥 Example output report👇

Together AI

66,541 görüntüleme • 1 yıl önce

Looking for a way to automate note-taking for your Zoom calls? I built a Zoom Meeting Summarizer using open source models and software. 📁 File Handling: Reads transcript files in WebVTT format 📋 Meeting Notes Extraction: Utilizes Phi-3 with ollama and to extract detailed meeting notes from transcripts. 🧹 Data Cleaning: Cleans and validates the extracted JSON using a second LLM. 🗂️ JSON and Markdown Output: Saves the extracted information in both JSON and Markdown formats. Code in comments.

Looking for a way to automate note-taking for your Zoom calls? I built a Zoom Meeting Summarizer using open source models and software. 📁 File Handling: Reads transcript files in WebVTT format 📋 Meeting Notes Extraction: Utilizes Phi-3 with ollama and to extract detailed meeting notes from transcripts. 🧹 Data Cleaning: Cleans and validates the extracted JSON using a second LLM. 🗂️ JSON and Markdown Output: Saves the extracted information in both JSON and Markdown formats. Code in comments.

metamike

16,168 görüntüleme • 2 yıl önce

Microsoft has launched a powerful new data analysis tool! Introducing Data Formulator, a 100% open-source LLM-powered, no-code tool that transforms data in a snap and creates stunning visualizations. Key features include: 🤖 AI-powered data transformation 🖱️ Interactive drag-and-drop UI for visualizations 💬 Seamless blend of UI & natural language inputs But that’s not all: You can even create charts beyond your initial dataset. Data Formulator automatically identifies extra computation needs, generates fields for you, and outputs the final visualization. Find the GitHub repo in the next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights and tutorials on AI and Machine Learning.

Microsoft has launched a powerful new data analysis tool! Introducing Data Formulator, a 100% open-source LLM-powered, no-code tool that transforms data in a snap and creates stunning visualizations. Key features include: 🤖 AI-powered data transformation 🖱️ Interactive drag-and-drop UI for visualizations 💬 Seamless blend of UI & natural language inputs But that’s not all: You can even create charts beyond your initial dataset. Data Formulator automatically identifies extra computation needs, generates fields for you, and outputs the final visualization. Find the GitHub repo in the next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights and tutorials on AI and Machine Learning.

Akshay 🚀

280,449 görüntüleme • 1 yıl önce

This is @OpenCode—an open source Claude Code alternative that: * has a built-in web UI * can use any LLM model, including free ones! In this demo, I'm running OpenCode inside a Daytona cloud sandbox.

This is @OpenCode—an open source Claude Code alternative that: * has a built-in web UI * can use any LLM model, including free ones! In this demo, I'm running OpenCode inside a Daytona cloud sandbox.

James Murdza

69,669 görüntüleme • 5 ay önce

Introducing `AutoRL` 📈 The world's simplest way to train a task-specific LLM with RL. *Just write a SENTENCE describing the model you want.* A chain of AI systems will generate data + rubrics and train a model for you. Powered by ART, it's open source. Link in thread:

Introducing `AutoRL` 📈 The world's simplest way to train a task-specific LLM with RL. Just write a SENTENCE describing the model you want. A chain of AI systems will generate data + rubrics and train a model for you. Powered by ART, it's open source. Link in thread:

Matt Shumer

150,107 görüntüleme • 11 ay önce

🌟 Create anything in 3D! 🌟 Introducing CAT3D: a new method that generates high-fidelity 3D scenes from any number of real or generated images in one minute, powered by multi-view diffusion models. w/ lovely coauthors Aleksander Holynski, Ben Poole and an amazing team!

🌟 Create anything in 3D! 🌟 Introducing CAT3D: a new method that generates high-fidelity 3D scenes from any number of real or generated images in one minute, powered by multi-view diffusion models. w/ lovely coauthors Aleksander Holynski, Ben Poole and an amazing team!

Ruiqi Gao

152,867 görüntüleme • 2 yıl önce

Introducing Ψ₀ ( — an open foundation model for universal humanoid loco-manipulation. 🏆 Outperforms GR00T N1.6 by 40%+ overall success rate 📉 Uses only ~10% of the pre-training data 📦 Fully open-source: model, data, code, and deployment pipeline 1/10

Introducing Ψ₀ ( — an open foundation model for universal humanoid loco-manipulation. 🏆 Outperforms GR00T N1.6 by 40%+ overall success rate 📉 Uses only ~10% of the pre-training data 📦 Fully open-source: model, data, code, and deployment pipeline 1/10

Yue Wang

19,190 görüntüleme • 3 ay önce