Loading video...

Video Failed to Load

Go Home

Can vision-language-action (VLA) models generalize to diverse OOD tasks and align with customized objectives? ๐Ÿค” ๐Ÿš€ We introduce GRAPE, a plug-and-play algorithm to generalize robot policies via preference alignment. GRAPE unfolds three benefits to boost the generalizability of VLAs: ๐Ÿ‘‰1. GRAPE aligns VLAs on a trajectory level and endows...

19,988 views โ€ข 1 year ago โ€ขvia X (Twitter)

7 Comments

Huaxiu Yao's profile picture
Huaxiu Yao1 year ago

[2/N] Detailed Method 1๏ธโƒฃ Trajectory-wise Preference Optimization: GRAPE scales up step-wise VLAs and trains with a trajectory-wise objective, aligning policies globally by learning from both successes and failures. 2๏ธโƒฃCustomized Preference Synthesis: GRAPE breaks down complex tasks into stages, guided by spatiotemporal constraints from VL models. Flexibly aligns for arbitrary objectives, such as safety, efficiency, or task success. 3๏ธโƒฃ Iterative Online Alignment: GRAPE refines the alignment process through iterative cycles of 1) online sample collection, 2) synthetic preference ranking, and 3) trajectory-wise preference optimization.

Huaxiu Yao's profile picture
Huaxiu Yao1 year ago

[3/N] Empirical Takeaway 1: Stronger generalizability on a wide array of OOD tasks. 1๏ธโƒฃ Real-world OOD tasks GRAPE crushes OpenVLA-SFT in generalization: - Visual (new visual environments) ๐ŸŒ†: +20.7% - Subject (unseen objects) ๐Ÿ”: +27.5% - Action (unseen actions)๐Ÿƒ: +10.0% - Semantic (unseen prompts)๐Ÿง : +5.0% - Language grounding (objects in unseen spatial positions)๐ŸŒ: +26.7% 2๏ธโƒฃ Simulation OOD tasks In Simpler-Env, GRAPE shines: - Subject (unseen objects) ๐Ÿ”: +8.0% - Physical (unseen object sizes/shapes) ๐Ÿ—๏ธ: +12.3% - Semantic (unseen prompts)๐Ÿง : +19.0%

Huaxiu Yao's profile picture
Huaxiu Yao1 year ago

[4/N] Empirical Takeaway 2: Versatility to align towards customized alignment objectives. GRAPE excels at aligning robot policies with diverse natural language goals: โœ… Task completion โœ… Safety โœ… Cost-efficiency Results: - ๐Ÿšง Safer policies: -44.31% collisions - โณ Efficient policies: -11.15% rollout lengths

Huaxiu Yao's profile picture
Huaxiu Yao1 year ago

[5/N] Nice work, @ZijianZhangNLP , Kyle Zheng, and nice collab. w/ @ZRChen_AISafety , @jang_yoel , @Yi_Li_UW , @chaoqi_w , @dingmyu , @fox_dieter17849

ZijianZhangNLP's profile picture
ZijianZhangNLP1 year ago

Cool work๏ผThank Prof. Yao and our nice collab!

zihao zheng's profile picture
zihao zheng1 year ago

what is the required training resources๏ผŸ

Daniel Butler's profile picture
Daniel Butler1 year ago

GRAPE seems like a promising leap for VLA models in robotics! The trajectory-level preference alignment and reward modeling are particularly intriguing for enabling safer, more efficient, and task-diverse applications. How scalable is GRAPE to real-world multi-agent environments?

Related Videos

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 views โ€ข 2 years ago