
Zirui "Colin" Wang
@zwcolin • 1,414 subscribers
Research Intern @MetaAI; CS PhD Student @Berkeley_AI and @BerkeleySky; prev @Princeton_NLP, @HDSIUCSD, @VoioInc multimodal interaction
Videos

👀Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM. We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware. 🌐Project: 📰Paper: 💻Code: 1/n
Zirui "Colin" Wang50,646 görüntüleme • 12 gün önce

🎮 We release VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (w/ Junyi Zhang Jiaxin Ge) 🌐 With 17 environments across multiple domains, we show systematically the brittleness of VLMs in visual interaction, and what training leads to. 🧵[1/8]
Zirui "Colin" Wang40,334 görüntüleme • 4 ay önce

🤨 Are Multimodal Large Language Models really as 𝐠𝐨𝐨𝐝 at 𝐜𝐡𝐚𝐫𝐭 𝐮𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐢𝐧𝐠 as existing benchmarks such as ChartQA suggest? 🚫 Our ℂ𝕙𝕒𝕣𝕏𝕚𝕧 benchmark suggests NO! 🥇Humans achieve ✨𝟖𝟎+% correctness. 🥈Sonnet 3.5 outperforms GPT-4o by 10+ points, reaching 🌟𝟔𝟎% correctness. 🥉Open-weight models are capped at ⭐𝟑𝟐% correctness. 🪜 Leaderboard: 📜 Preprint: 📊 Charxiv is ✨𝟏𝟎𝟎% handcrafted with rigorous human validation, and it reveals substantial gaps among Multimodal Large Language Models and humans in chart understanding. 🎥👇 80 second video (🎶sound on!). 🧶 1/6
Zirui "Colin" Wang48,195 görüntüleme • 1 yıl önce
Daha fazla içerik yok.