Загрузка видео...
Не удалось загрузить видео
Humans use pointing to communicate plans intuitively. Compared to language, pointing gives more precise guidance to robot behaviors. Can we teach a robot how to point like humans? Introducing RoboPoint 🤖👉, an open-source VLM instruction-tuned to point. Check out our new work:
64,377 просмотров • 2 лет назад •via X (Twitter)
Комментарии: 10

1/8🧵: Language is not precise enough to guide robot behavior. Even the most powerful VLMs like @chatgpt4o have limited accuracy in real robot execution, especially when the task uses spatial relations to identify objects or refer to object-free locations, e.g. “place the cup next to the plate".

2/8🧵: Studies in developmental psychology [Tomasello et al (2007)] show that infants and adults alike share information about their environment via pointing, leading to intuitive plans and precise actions. RoboPoint mimics human pointing by generating actions as points in the RGB image.

3/8🧵: As a result, RoboPoint generates precise action guidance. Its generic action space applies to multiple downstream tasks including manipulation, navigation and AR. RoboPoint outperforms Qwen-VL, GPT-4V, and PIVOT by 30.5% on average success rate across all tasks.

4/8🧵: RoboPoint is instruction-tuned on a mix of real VQA data and procedurally generated robotic data. Surprisingly, by adding synthetic data with templated language, the resulting model's performance improves on real images with natural language commands.

5/8🧵: RoboPoint outperforms state-of-the-art VLMs with visual prompting including @chatgpt4o, Qwen-VL, LLaVA-NeXT, SpaceVLM by 21.8% in the accuracy of predicting spatial affordance. It generalizes to relation types unseen during training such as “in the middle”.

6/8🧵: RoboPoint generates consistent predictions with moving camera. Its predictions are equivariant for view-dependent relations like “to the right” and invariant for view-independent relations like “in between”.

7/8🧵: Exciting concurrent works (SpatialRGPT @anjjei , SpatialVLM @BoyuanChen0 , PIVOT @xf1280 , MOKA @fangchenliu_ ) highlight the trend of enhancing spatial reasoning in robotic VLMs. RoboPoint adds to this by exploring precise affordance prediction through instruction-tuning.

8/8🧵: This great work was led by @TonyWentaoYuan , assisted by me and many other collaborators from @nvidia @allen_ai @uwcse

@nvidia @allen_ai @uwcse Lastly, both of us will be at @CVPR #CVPR2024 ! So come and talk to us👋

An important ability


