Jianwei Yang's banner

Jianwei Yang

@jw2yang4ai • 3,979 subscribers

MTS @xAI. Core contributor of Project Florence, Phi-3V, Omniparser; (Co-)Inventor of FocalNet, SEEM, SoM, DeepStack (in Qwen3VL) and Magma.

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Accepted by #CVPR2023! X-Decoder is the FIRST generalist decoder that supports all segmentation tasks (ins/sem/pano/ref) in OPEN VOCABULARY, both inter- AND intra-image VL tasks, and even helps instruct image inpainting/editing! New demo below and more at

Accepted by #CVPR2023! X-Decoder is the FIRST generalist decoder that supports all segmentation tasks (ins/sem/pano/ref) in OPEN VOCABULARY, both inter- AND intra-image VL tasks, and even helps instruct image inpainting/editing! New demo below and more at

51,930 просмотров • 3 лет назад

🔥In Magma, we talked a lot about spatial/temporal intelligence beyond verbal intelligencen as advocated by Dr. Fei-Fei Li. So how to interpret it? Today I am happy to announce a new demo Magma-Gaming: 👉 Rather than asking LLMs to write game code, we further ask the model to PLAY the game. A simple game like moving to the target in a 2D grid still requires precise action grounding and planning capability, yet challenges the most advanced VLMs and even GPT-4o-mini model. Magma, born with stronger spatial understanding and reasoning ability, significantly outperforms the counterparts and achieves much higher scores in zero-shot manner. This result pinpoints the huge potential of building multimodal agentic models endowed with both verbal and spatial inteligence! Also, I believe this simple demo gives you a better hints to understand what is spatial intelligence, and why it is important!

🔥In Magma, we talked a lot about spatial/temporal intelligence beyond verbal intelligencen as advocated by Dr. Fei-Fei Li. So how to interpret it? Today I am happy to announce a new demo Magma-Gaming: 👉 Rather than asking LLMs to write game code, we further ask the model to PLAY the game. A simple game like moving to the target in a 2D grid still requires precise action grounding and planning capability, yet challenges the most advanced VLMs and even GPT-4o-mini model. Magma, born with stronger spatial understanding and reasoning ability, significantly outperforms the counterparts and achieves much higher scores in zero-shot manner. This result pinpoints the huge potential of building multimodal agentic models endowed with both verbal and spatial inteligence! Also, I believe this simple demo gives you a better hints to understand what is spatial intelligence, and why it is important!

17,940 просмотров • 1 год назад

📢 Join us tomorrow morning at our CVPR 2025 poster session (#340, ExHall D, 10:30am–12:30pm) to chat about Project Magma 👉 This is a big team effort to build a multimodal agentic model capable of understanding and acting in both digital and physical environments—just like how we interact with the world every day. 🚀 Even more exciting: we demonstrate the scaling potential of agent pretraining on large-scale human instructional videos through our Set-of-Mark (SoM) and Trace-of-Mark (ToM), showcasing strong zero-shot performance in: 1) multimodal image/video understanding, 2) UI navigation, 3) Real-world robot manipulation and even 4) Gaming! We've received encouraging feedback over the past few days—and this is only the beginning. A small step forward, with exciting things ahead!

📢 Join us tomorrow morning at our CVPR 2025 poster session (#340, ExHall D, 10:30am–12:30pm) to chat about Project Magma 👉 This is a big team effort to build a multimodal agentic model capable of understanding and acting in both digital and physical environments—just like how we interact with the world every day. 🚀 Even more exciting: we demonstrate the scaling potential of agent pretraining on large-scale human instructional videos through our Set-of-Mark (SoM) and Trace-of-Mark (ToM), showcasing strong zero-shot performance in: 1) multimodal image/video understanding, 2) UI navigation, 3) Real-world robot manipulation and even 4) Gaming! We've received encouraging feedback over the past few days—and this is only the beginning. A small step forward, with exciting things ahead!

13,112 просмотров • 1 год назад

Больше нет контента для загрузки