Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

🚀 Introducing T* and LV-Haystack — our latest leap forward in VLMs for long video understanding! 🧩 Lightweight plugin: T* boosting LLaVA-OV-72B (56→62%) and GPT-4o (50→53%)! ⚡ Fast inference: 34.9s → 10.4s latency, 691 → 170 TFLOPs v.s. SOTA. 📚 Large-scale dataset: 400 hours of videos + 15,000 samples.... show more

Zihan "Zenus" Wang

23,015 subscribers

49,615 görüntüleme • 1 yıl önce •via X (Twitter)

Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

10 Yorum

Zihan Wang - on RAGEN profil fotoğrafı

Zihan Wang - on RAGEN1 yıl önce

Explore more: 📄 paper: 🤗 dataset: 🌐 website: 🤖 demo: 🛠️ github:

Zihan Wang - on RAGEN profil fotoğrafı

Zihan Wang - on RAGEN1 yıl önce

What’s T* ✨? A temporal search framework to locate key frames for questions. Can be plug-in to any VLM! T* turns temporal search ⏱️ into spatial search 📍 with lightweight object detectors + VLM visual grounding. Strong performance even w/o training VLMs! 2/

Zihan Wang - on RAGEN profil fotoğrafı

Zihan Wang - on RAGEN1 yıl önce

What’s LV-Haystack? A large-scale video understanding dataset: 🎞️ 400 hours of video ❓ 15,000 QA pairs 🔑 30,000 key frame labels from 45,000,000 frames We explore disentangled evaluation of temporal search & video understanding with 6 fine-grained search metrics. 3/

Zihan Wang - on RAGEN profil fotoğrafı

Zihan Wang - on RAGEN1 yıl önce

T* and LV-Haystack are the result of a joint effort of @StanfordHAI @StanfordAILab @StanfordSVL @NorthwesternEng @LTIatCMU. Huge shoutout to our incredible team for making this possible! We’d love your feedback! Reply or email us with questions, ideas, or use cases✨ 4/

Zihan Wang - on RAGEN profil fotoğrafı

Zihan Wang - on RAGEN1 yıl önce

h/t to all collaborators: @jinhuiye @wzihanw @Haosen_sun @keshigeyan @DuranteZane @CristbalEyzagu2 @anabellaisaro and our amazing mentors: @ManlingLi_ @jiajunwu_cs @drfeifei @eadeli @jcniebles @ybisk! This is just the beginning—excited for the future of video understanding and what’s next! ✨5/

Lucid Scientific, Inc. profil fotoğrafı

Lucid Scientific, Inc.1 yıl önce

Expand the possibilities of your metabolic research. Resipher tracks real-time cellular oxygen consumption in standard 96-well plates, delivering continuous real-time data directly from your incubator. Request a free virtual demo or quote today >>

Electe profil fotoğrafı

Electe1 yıl önce

@StanfordAILab @StanfordAILab, exciting advancements in video understanding.

@profitleap profil fotoğrafı

@profitleap1 yıl önce

@StanfordAILab Exciting advancements in VLMs. Looking forward to seeing the impact they will have on video understanding. 🔍

Zihan Wang - on RAGEN profil fotoğrafı

Zihan Wang - on RAGEN1 yıl önce

Great question! The VLMs we are using cannot accept audio input for now, and we think this line of research may be exciting to explore in the near future:)

Hexa Circuit profil fotoğrafı

Hexa Circuit1 yıl önce

It's essential to examine how this new integration will enhance semantic retrieval in lengthy multimedia datasets. Looks promising for advanced analytics.

Benzer Videolar

🚀 Introducing MiniCPM-V 4.5 8B: pushing the boundary of multimodal AI! ～ SOTA VL Capability: Surpasses GPT-4o, Gemini 2.0 Pro, Qwen2.5-VL 72B on OpenCompass! ～ "Eagle Eye" Video: 96x visual token compression for high refresh rate and long video understanding ～ Controllable Hybrid Fast/Deep Thinking ～ Strong OCR & Doc Parsing: Surpasses GPT-4o & Gemini 2.5 on OmniDocBench Get ready for the future of multimodal AI 👉 Huggingface｜ Github｜ Gradio｜ #AI #MiniCM #GPT #Gemini #OpenBMB #ArtificialIntelligence #MachineLearning

🚀 Introducing MiniCPM-V 4.5 8B: pushing the boundary of multimodal AI! ～ SOTA VL Capability: Surpasses GPT-4o, Gemini 2.0 Pro, Qwen2.5-VL 72B on OpenCompass! ～ "Eagle Eye" Video: 96x visual token compression for high refresh rate and long video understanding ～ Controllable Hybrid Fast/Deep Thinking ～ Strong OCR & Doc Parsing: Surpasses GPT-4o & Gemini 2.5 on OmniDocBench Get ready for the future of multimodal AI 👉 Huggingface｜ Github｜ Gradio｜ #AI #MiniCM #GPT #Gemini #OpenBMB #ArtificialIntelligence #MachineLearning

OpenBMB

25,068 görüntüleme • 10 ay önce

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Boyi Li

66,999 görüntüleme • 6 ay önce

Video understanding isn't just recognizing —it demands reasoning across thousands of frames. Meet Long-RL🚀 Highlights: 🧠 Dataset: LongVideo-Reason — 52K QAs with reasoning. ⚡ System: MR-SP - 2.1× faster RL for long videos. 📈 Scalability: Hour-long videos (3,600 frames) RL on a single node (8×A100s). 🖼️📝🎵 RL training for video, text, audio — works with VILA, Qwen series, and image/video generation models 🎨🎬 📄 Paper: 🎥 Demo: 💻 Code:

Video understanding isn't just recognizing —it demands reasoning across thousands of frames. Meet Long-RL🚀 Highlights: 🧠 Dataset: LongVideo-Reason — 52K QAs with reasoning. ⚡ System: MR-SP - 2.1× faster RL for long videos. 📈 Scalability: Hour-long videos (3,600 frames) RL on a single node (8×A100s). 🖼️📝🎵 RL training for video, text, audio — works with VILA, Qwen series, and image/video generation models 🎨🎬 📄 Paper: 🎥 Demo: 💻 Code:

Yukang Chen

31,652 görüntüleme • 11 ay önce

PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking paper page: introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, we animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos. We create combinatorial diversity by randomizing character appearance, motion profiles, materials, lighting, 3D assets, and atmospheric effects. Our dataset currently includes 104 videos, averaging 2,000 frames long, with orders of magnitude more correspondence annotations than prior work. We show that existing methods can be trained from scratch in our dataset and outperform the published variants. Finally, we introduce modifications to the PIPs point tracking method, greatly widening its temporal receptive field, which improves its performance on PointOdyssey as well as on two real-world benchmarks.

PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking paper page: introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, we animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos. We create combinatorial diversity by randomizing character appearance, motion profiles, materials, lighting, 3D assets, and atmospheric effects. Our dataset currently includes 104 videos, averaging 2,000 frames long, with orders of magnitude more correspondence annotations than prior work. We show that existing methods can be trained from scratch in our dataset and outperform the published variants. Finally, we introduce modifications to the PIPs point tracking method, greatly widening its temporal receptive field, which improves its performance on PointOdyssey as well as on two real-world benchmarks.

AK

122,533 görüntüleme • 2 yıl önce

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 görüntüleme • 1 yıl önce

Excited to share our new work: StreamingVLM! 🚀 We tackle a major challenge for Vision-Language Models (VLMs): understanding infinite video streams in real-time without latency blowing up or running out of memory. Paper: Code:

Excited to share our new work: StreamingVLM! 🚀 We tackle a major challenge for Vision-Language Models (VLMs): understanding infinite video streams in real-time without latency blowing up or running out of memory. Paper: Code:

Guangxuan Xiao

91,987 görüntüleme • 8 ay önce

First start of FSD in The Netherlands 🇳🇱 TLDR: long video, short quiz

First start of FSD in The Netherlands 🇳🇱 TLDR: long video, short quiz

Kees Roelandschap

865,940 görüntüleme • 2 ay önce

🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step closer to World Models. All code for training & inference, model weights, demo page, and videos released! Paper: Code: Model: Demo Page: Introduction Video:

🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step closer to World Models. All code for training & inference, model weights, demo page, and videos released! Paper: Code: Model: Demo Page: Introduction Video:

Yukang Chen

11,835 görüntüleme • 9 ay önce

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

Haotian Liu

302,319 görüntüleme • 3 yıl önce

We’re dropping Gemini Omni: our first step towards a model that can create anything from anything - starting with video. It combines Gemini’s intelligence with our generative media systems - representing a leap forward in world understanding, multimodality, and editing 🧵

We’re dropping Gemini Omni: our first step towards a model that can create anything from anything - starting with video. It combines Gemini’s intelligence with our generative media systems - representing a leap forward in world understanding, multimodality, and editing 🧵

Google DeepMind

1,582,787 görüntüleme • 1 ay önce

Introducing Long Zhao, a Senior Research Scientist at Google, who worked to build VideoPrism: A Foundational Visual Encoder for Video Understanding. Read the blog to explore innovations in video understanding tasks and more →

Introducing Long Zhao, a Senior Research Scientist at Google, who worked to build VideoPrism: A Foundational Visual Encoder for Video Understanding. Read the blog to explore innovations in video understanding tasks and more →

Google AI

129,768 görüntüleme • 2 yıl önce

🚀Introducing MiniCPM-V 2.6! 🔥 1、Surpassing GPT-4V in single image, multi-image and video understanding 📸🎥 2、Outperforms GPT-4o mini and Gemini 1.5 on OpenCompass 🏆 3、Real-time video analysis on iPad 📱💨 Try out the best on-device multimodal LLM here！ 👑 GitHub： Huggingface： #MLLM #MiniCPM

🚀Introducing MiniCPM-V 2.6! 🔥 1、Surpassing GPT-4V in single image, multi-image and video understanding 📸🎥 2、Outperforms GPT-4o mini and Gemini 1.5 on OpenCompass 🏆 3、Real-time video analysis on iPad 📱💨 Try out the best on-device multimodal LLM here！ 👑 GitHub： Huggingface： #MLLM #MiniCPM

OpenBMB

196,281 görüntüleme • 1 yıl önce

🤯ByteDance just Open Sourced UI-TARS - 2 SOTA models (7B & 72B) + a PC/MacOS app to control your computer with vLMS And they are not messing around, beating GPT-4o and Claude, SOTA across 10 benchmarks Will you be installing this on your pc?

🤯ByteDance just Open Sourced UI-TARS - 2 SOTA models (7B & 72B) + a PC/MacOS app to control your computer with vLMS And they are not messing around, beating GPT-4o and Claude, SOTA across 10 benchmarks Will you be installing this on your pc?

Alex Volkov

69,738 görüntüleme • 1 yıl önce

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: 📰 Paper: 🎥 Demo: #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: 📰 Paper: 🎥 Demo: #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

Yukang Chen

58,301 görüntüleme • 1 ay önce

Introducing Veo 3.1 and Veo 3.1 Fast, our latest state of the art video models with: - richer native audio - better cinematic styles - reference to video - transitions between frames - video extensions

Introducing Veo 3.1 and Veo 3.1 Fast, our latest state of the art video models with: - richer native audio - better cinematic styles - reference to video - transitions between frames - video extensions

Logan Kilpatrick

255,293 görüntüleme • 8 ay önce

1. Breaking down + understanding a long video I uploaded the entire NBA dunk contest from last night and asked which dunk had the highest score. Gemini 1.5 was incredibly able to find the specific perfect 50 dunk and details from just its long context video understanding!

1. Breaking down + understanding a long video I uploaded the entire NBA dunk contest from last night and asked which dunk had the highest score. Gemini 1.5 was incredibly able to find the specific perfect 50 dunk and details from just its long context video understanding!

Rowan Cheung

101,150 görüntüleme • 2 yıl önce

Together with the Ego4D consortium, we recently released Ego-Exo4D: A diverse, large-scale multi-modal, multi-view, video dataset and benchmark. Learn more about the work ➡️ Access the dataset ➡️ This work could help to advance AI models' understanding of complex human skills & enable new applications for AR systems, robotics & more.

Together with the Ego4D consortium, we recently released Ego-Exo4D: A diverse, large-scale multi-modal, multi-view, video dataset and benchmark. Learn more about the work ➡️ Access the dataset ➡️ This work could help to advance AI models' understanding of complex human skills & enable new applications for AR systems, robotics & more.

AI at Meta

78,059 görüntüleme • 2 yıl önce

introducing Egocentric-1M. the largest egocentric video dataset in the world, and our next step in building the internet for physical AI.

introducing Egocentric-1M. the largest egocentric video dataset in the world, and our next step in building the internet for physical AI.

Eddy Xu

336,734 görüntüleme • 2 ay önce

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

Rowan Cheung

681,544 görüntüleme • 2 yıl önce

Video breakdown: CVPR 2026 oral paper Shattering the efficiency bottleneck of long video understanding. Watch as we deconstruct the core innovations behind SpecTemp.

Video breakdown: CVPR 2026 oral paper Shattering the efficiency bottleneck of long video understanding. Watch as we deconstruct the core innovations behind SpecTemp.

FutureLivingLab

13,558 görüntüleme • 1 ay önce