phd @kaist_ai | ex @NVIDIAAI @GoogleAI @NYU_Courant
Shorts
Can MLLMs actually track what's happening in a video? Introducing VSTAT 🎯, our new benchmark for visual state tracking. The tasks are simple: count cups, read typed words, count page flips. Humans solve them easily. MLLMs don't. 🧵 [1/11]