
Jianwei Yang
@jw2yang4ai • 3,979 subscribers
MTS @xAI. Core contributor of Project Florence, Phi-3V, Omniparser; (Co-)Inventor of FocalNet, SEEM, SoM, DeepStack (in Qwen3VL) and Magma.
Videos

🔥In Magma, we talked a lot about spatial/temporal intelligence beyond verbal intelligencen as advocated by Dr. Fei-Fei Li. So how to interpret it? Today I am happy to announce a new demo Magma-Gaming: 👉 Rather than asking LLMs to write game code, we further ask the model to PLAY the game. A simple game like moving to the target in a 2D grid still requires precise action grounding and planning capability, yet challenges the most advanced VLMs and even GPT-4o-mini model. Magma, born with stronger spatial understanding and reasoning ability, significantly outperforms the counterparts and achieves much higher scores in zero-shot manner. This result pinpoints the huge potential of building multimodal agentic models endowed with both verbal and spatial inteligence! Also, I believe this simple demo gives you a better hints to understand what is spatial intelligence, and why it is important!
Jianwei Yang17,940 просмотров • 1 год назад

📢 Join us tomorrow morning at our CVPR 2025 poster session (#340, ExHall D, 10:30am–12:30pm) to chat about Project Magma 👉 This is a big team effort to build a multimodal agentic model capable of understanding and acting in both digital and physical environments—just like how we interact with the world every day. 🚀 Even more exciting: we demonstrate the scaling potential of agent pretraining on large-scale human instructional videos through our Set-of-Mark (SoM) and Trace-of-Mark (ToM), showcasing strong zero-shot performance in: 1) multimodal image/video understanding, 2) UI navigation, 3) Real-world robot manipulation and even 4) Gaming! We've received encouraging feedback over the past few days—and this is only the beginning. A small step forward, with exciting things ahead!
Jianwei Yang13,112 просмотров • 11 месяцев назад
Больше нет контента для загрузки