Loading video...

Video Failed to Load

Go Home

Humans use pointing to communicate plans intuitively. Compared to language, pointing gives more precise guidance to robot behaviors. Can we teach a robot how to point like humans? Introducing RoboPoint 🤖👉, an open-source VLM instruction-tuned to point. Check out our new work:

64,377 views • 2 years ago •via X (Twitter)

10 Comments

Jiafei Duan's profile picture
Jiafei Duan2 years ago

1/8🧵: Language is not precise enough to guide robot behavior. Even the most powerful VLMs like @chatgpt4o have limited accuracy in real robot execution, especially when the task uses spatial relations to identify objects or refer to object-free locations, e.g. “place the cup next to the plate".

Jiafei Duan's profile picture
Jiafei Duan2 years ago

2/8🧵: Studies in developmental psychology [Tomasello et al (2007)] show that infants and adults alike share information about their environment via pointing, leading to intuitive plans and precise actions. RoboPoint mimics human pointing by generating actions as points in the RGB image.

Jiafei Duan's profile picture
Jiafei Duan2 years ago

3/8🧵: As a result, RoboPoint generates precise action guidance. Its generic action space applies to multiple downstream tasks including manipulation, navigation and AR. RoboPoint outperforms Qwen-VL, GPT-4V, and PIVOT by 30.5% on average success rate across all tasks.

Jiafei Duan's profile picture
Jiafei Duan2 years ago

4/8🧵: RoboPoint is instruction-tuned on a mix of real VQA data and procedurally generated robotic data. Surprisingly, by adding synthetic data with templated language, the resulting model's performance improves on real images with natural language commands.

Jiafei Duan's profile picture
Jiafei Duan2 years ago

5/8🧵: RoboPoint outperforms state-of-the-art VLMs with visual prompting including @chatgpt4o, Qwen-VL, LLaVA-NeXT, SpaceVLM by 21.8% in the accuracy of predicting spatial affordance. It generalizes to relation types unseen during training such as “in the middle”.

Jiafei Duan's profile picture
Jiafei Duan2 years ago

6/8🧵: RoboPoint generates consistent predictions with moving camera. Its predictions are equivariant for view-dependent relations like “to the right” and invariant for view-independent relations like “in between”.

Jiafei Duan's profile picture
Jiafei Duan2 years ago

7/8🧵: Exciting concurrent works (SpatialRGPT @anjjei , SpatialVLM @BoyuanChen0 , PIVOT @xf1280 , MOKA @fangchenliu_ ) highlight the trend of enhancing spatial reasoning in robotic VLMs. RoboPoint adds to this by exploring precise affordance prediction through instruction-tuning.

Jiafei Duan's profile picture
Jiafei Duan2 years ago

8/8🧵: This great work was led by @TonyWentaoYuan , assisted by me and many other collaborators from @nvidia @allen_ai @uwcse

Jiafei Duan's profile picture
Jiafei Duan2 years ago

@nvidia @allen_ai @uwcse Lastly, both of us will be at @CVPR #CVPR2024 ! So come and talk to us👋

RoboDepot🤖's profile picture
RoboDepot🤖2 years ago

An important ability

Related Videos