Video yükleniyor...
Video Yüklenemedi
🚀 Introducing T* and LV-Haystack — our latest leap forward in VLMs for long video understanding! 🧩 Lightweight plugin: T* boosting LLaVA-OV-72B (56→62%) and GPT-4o (50→53%)! ⚡ Fast inference: 34.9s → 10.4s latency, 691 → 170 TFLOPs v.s. SOTA. 📚 Large-scale dataset: 400 hours of videos + 15,000 samples.... show more
49,615 görüntüleme • 1 yıl önce •via X (Twitter)
10 Yorum

Explore more: 📄 paper: 🤗 dataset: 🌐 website: 🤖 demo: 🛠️ github:

What’s T* ✨? A temporal search framework to locate key frames for questions. Can be plug-in to any VLM! T* turns temporal search ⏱️ into spatial search 📍 with lightweight object detectors + VLM visual grounding. Strong performance even w/o training VLMs! 2/

What’s LV-Haystack? A large-scale video understanding dataset: 🎞️ 400 hours of video ❓ 15,000 QA pairs 🔑 30,000 key frame labels from 45,000,000 frames We explore disentangled evaluation of temporal search & video understanding with 6 fine-grained search metrics. 3/

T* and LV-Haystack are the result of a joint effort of @StanfordHAI @StanfordAILab @StanfordSVL @NorthwesternEng @LTIatCMU. Huge shoutout to our incredible team for making this possible! We’d love your feedback! Reply or email us with questions, ideas, or use cases✨ 4/

h/t to all collaborators: @jinhuiye @wzihanw @Haosen_sun @keshigeyan @DuranteZane @CristbalEyzagu2 @anabellaisaro and our amazing mentors: @ManlingLi_ @jiajunwu_cs @drfeifei @eadeli @jcniebles @ybisk! This is just the beginning—excited for the future of video understanding and what’s next! ✨5/

Expand the possibilities of your metabolic research. Resipher tracks real-time cellular oxygen consumption in standard 96-well plates, delivering continuous real-time data directly from your incubator. Request a free virtual demo or quote today >>

@StanfordAILab @StanfordAILab, exciting advancements in video understanding.

@StanfordAILab Exciting advancements in VLMs. Looking forward to seeing the impact they will have on video understanding. 🔍

Great question! The VLMs we are using cannot accept audio input for now, and we think this line of research may be exciting to explore in the near future:)

It's essential to examine how this new integration will enhance semantic retrieval in lengthy multimedia datasets. Looks promising for advanced analytics.
