Martin Ziqiao Ma's banner
Martin Ziqiao Ma's profile picture

Martin Ziqiao Ma

@ziqiao_ma4,219 subscribers

technical staff @thinkymachines; less technical stuff @aclmentorship; phd @umich; views are my own

Videos

ziqiao_ma's profile picture

Do Vision-Language Models represent space, and how? Spatial terms like "left" or "right" may not be enough to match images with spatial descriptions, as we often overlook the different frames of reference (FoR) used by speakers and listeners. See Figure 1 for examples! Introducing the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to assess the spatial reasoning capabilities of VLMs. COMFORT includes systematically designed datasets and metrics that evaluate model performance, and their deeper linguistic competence, specifically the spatial knowledge encoded in their internal representations. Find out more in the video teaser! Almost all VLMs prefer the egocentric relative FoR with reflected transform, similar to English. Yet, we reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. A shortened version will appear in Pluralistic Alignment Workshop Pluralistic Alignment Workshop #NeurIPS2024. It seems that the ArXiv moderators put it on hold and are eager to give it a thorough read first🤣! So here is the Paper/Code/Data: This collaboration turns out to be amazing, jointly led by Brian Zheyuan Zhang, @Hu_FY_ Jayjun Lee, with so many contributions and insights from Freda Shi, Parisa Kordjamshidi Michigan SLED Lab. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning!

Martin Ziqiao Ma

35,542 次观看 • 1 年前

没有更多内容可加载