Video wird geladen...
Video konnte nicht geladen werden
Do Vision-Language Models represent space, and how? Spatial terms like "left" or "right" may not be enough to match images with spatial descriptions, as we often overlook the different frames of reference (FoR) used by speakers and listeners. See Figure 1 for examples! Introducing the COnsistent Multilingual Frame Of... show more
35,542 Aufrufe • vor 1 Jahr •via X (Twitter)
1 Kommentare

Martin Ziqiao Mavor 1 Jahr
Interesting fact: Reasoning across multiple intrinsic frames of reference is quite challenging, even for GPT-4. I adapted Figure 2.5 from Levinson 2003 (Logical Inadequacies of the Intrinsic Frame of Reference) into a yes/no question format, and GPT-4 struggled with both.
