Loading video...
Video Failed to Load
Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!
48,739 views • 1 year ago •via X (Twitter)
15 Comments

1/13🧵: Failure detection in robotics is challenging. Traditionally, we've used trained classifiers or heuristics, framing it as a binary classification problem. Recently, there's been a shift toward using LLMs and VLMs as open-world failure detectors. However, just as we improve by understanding failures, our robots should too.

2/13🧵: Behind every successful robot demonstration, there are tons of failed attempts from tele-operation or policy rollouts—but these failures are often not collected. How can we scale up robot failure data for instruction-tuning a VLM to reason about failure?

3/13🧵: To systematically curate failures in simulation, we introduce FailGen—a custom environment wrapper for generating failures for robotic manipulation in any simulator. It takes successful robot demonstrations and perturbs them to generate failures systematically. Could generating instruction-tuning data for robotics in simulation be a new paradigm shift?

4/13🧵: Using FailGen, we procedurally generated the largest failure dataset for robotic manipulation covering 79 tasks from RLBench (@stepjamUK) over 49K data points, along with corresponding failure explanations for instruction-tuning AHA.

5/13🧵: We instruction-tuned AHA-13B using a mix of real VQA data and our procedurally generated robotic failure data. Surprisingly, adding synthetic data with templated language improved the model's performance.

6/13🧵:AHA-13B generalizes failure reasoning across different embodiments, unseen domains, and novel tasks. It outperforms other state-of-the-art VLMs on three failure datasets: RoboFail (Test), ManiSkill-Fail @Stone_Tao, and REFLECT from @Liu_Zeyi_ !

7/13🧵: Can we still use AHA as a general-purpose VLM? Yes! It retains all the general-purpose knowledge of the base VLM that inspired it.

8/13🧵: AHA also scales in performance when more failure data!

9/13🧵: In recent years, there have been many robotics works that leverage LLMs and VLMs for generating reward functions, bounding boxes, sub-task verification, and task planning. Using AHA to provide failure explanations could enhance or accelerate the performance of LLMs/VLMs in these tasks, improving downstream results.

10/13🧵: We demonstrated three downstream robotic applications using AHA. First, we integrated AHA with Eureka's @JasonMa2020 formulation to accelerate the reward function search.

11/13🧵: We can also integrate AHA into TAMP systems that use LLMs for task-plan generation (like Proc3s @nishanthkumar23). AHA helps refine task plans to better align with human intent.

12/13🧵: We can use AHA to replace sub-task verifiers in systems like Manipulate-Anything, increasing task success rates.

13/13🧵: We're excited to see more robotics applications leverage AHA for failure reasoning feedback. Understanding failure could be key to building the 🍓 version of RoboGPT.

Lastly, i would like to thank all my collaborators: @wpumacay7567 @YiruHelenWang @TonyWentaoYuan @shulin_tian and my advisor @RanjayKrishna Dieter Fox and most importantly my @NVIDIARobotics mentors @AjayMandlekar Yijie Guo!

@YiruHelenWang @TonyWentaoYuan @shulin_tian @RanjayKrishna @NVIDIARobotics @AjayMandlekar and my buddy @nishanthkumar23
