Zirui "Colin" Wang's banner

Zirui "Colin" Wang

@zwcolin • 1,439 subscribers

Research Intern @MetaAI; CS PhD Student @Berkeley_AI and @BerkeleySky; prev @Princeton_NLP, @HDSIUCSD, @VoioInc multimodal interaction

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

👀Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM. We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware. 🌐Project: 📰Paper: 💻Code: 1/n

👀Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM. We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware. 🌐Project: 📰Paper: 💻Code: 1/n

Zirui "Colin" Wang

51,850 görüntüleme • 1 ay önce

🎮 We release VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (w/ Junyi Zhang @aomaru_21490) 🌐 With 17 environments across multiple domains, we show systematically the brittleness of VLMs in visual interaction, and what training leads to. 🧵[1/8]

🎮 We release VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (w/ Junyi Zhang @aomaru_21490) 🌐 With 17 environments across multiple domains, we show systematically the brittleness of VLMs in visual interaction, and what training leads to. 🧵[1/8]

Zirui "Colin" Wang

40,493 görüntüleme • 6 ay önce

🤨 Are Multimodal Large Language Models really as 𝐠𝐨𝐨𝐝 at 𝐜𝐡𝐚𝐫𝐭 𝐮𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐢𝐧𝐠 as existing benchmarks such as ChartQA suggest? 🚫 Our ℂ𝕙𝕒𝕣𝕏𝕚𝕧 benchmark suggests NO! 🥇Humans achieve ✨𝟖𝟎+% correctness. 🥈Sonnet 3.5 outperforms GPT-4o by 10+ points, reaching 🌟𝟔𝟎% correctness. 🥉Open-weight models are capped at ⭐𝟑𝟐% correctness. 🪜 Leaderboard: 📜 Preprint: 📊 Charxiv is ✨𝟏𝟎𝟎% handcrafted with rigorous human validation, and it reveals substantial gaps among Multimodal Large Language Models and humans in chart understanding. 🎥👇 80 second video (🎶sound on!). 🧶 1/6

🤨 Are Multimodal Large Language Models really as 𝐠𝐨𝐨𝐝 at 𝐜𝐡𝐚𝐫𝐭 𝐮𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐢𝐧𝐠 as existing benchmarks such as ChartQA suggest? 🚫 Our ℂ𝕙𝕒𝕣𝕏𝕚𝕧 benchmark suggests NO! 🥇Humans achieve ✨𝟖𝟎+% correctness. 🥈Sonnet 3.5 outperforms GPT-4o by 10+ points, reaching 🌟𝟔𝟎% correctness. 🥉Open-weight models are capped at ⭐𝟑𝟐% correctness. 🪜 Leaderboard: 📜 Preprint: 📊 Charxiv is ✨𝟏𝟎𝟎% handcrafted with rigorous human validation, and it reveals substantial gaps among Multimodal Large Language Models and humans in chart understanding. 🎥👇 80 second video (🎶sound on!). 🧶 1/6

Zirui "Colin" Wang

48,221 görüntüleme • 2 yıl önce

Daha fazla içerik yok.