正在加载视频...

视频加载失败

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

48,739 次观看 • 1 年前 •via X (Twitter)

15 条评论

Jiafei Duan 的头像
Jiafei Duan1 年前

1/13🧵: Failure detection in robotics is challenging. Traditionally, we've used trained classifiers or heuristics, framing it as a binary classification problem. Recently, there's been a shift toward using LLMs and VLMs as open-world failure detectors. However, just as we improve by understanding failures, our robots should too.

Jiafei Duan 的头像
Jiafei Duan1 年前

2/13🧵: Behind every successful robot demonstration, there are tons of failed attempts from tele-operation or policy rollouts—but these failures are often not collected. How can we scale up robot failure data for instruction-tuning a VLM to reason about failure?

Jiafei Duan 的头像
Jiafei Duan1 年前

3/13🧵: To systematically curate failures in simulation, we introduce FailGen—a custom environment wrapper for generating failures for robotic manipulation in any simulator. It takes successful robot demonstrations and perturbs them to generate failures systematically. Could generating instruction-tuning data for robotics in simulation be a new paradigm shift?

Jiafei Duan 的头像
Jiafei Duan1 年前

4/13🧵: Using FailGen, we procedurally generated the largest failure dataset for robotic manipulation covering 79 tasks from RLBench (@stepjamUK) over 49K data points, along with corresponding failure explanations for instruction-tuning AHA.

Jiafei Duan 的头像
Jiafei Duan1 年前

5/13🧵: We instruction-tuned AHA-13B using a mix of real VQA data and our procedurally generated robotic failure data. Surprisingly, adding synthetic data with templated language improved the model's performance.

Jiafei Duan 的头像
Jiafei Duan1 年前

6/13🧵:AHA-13B generalizes failure reasoning across different embodiments, unseen domains, and novel tasks. It outperforms other state-of-the-art VLMs on three failure datasets: RoboFail (Test), ManiSkill-Fail @Stone_Tao, and REFLECT from @Liu_Zeyi_ !

Jiafei Duan 的头像
Jiafei Duan1 年前

7/13🧵: Can we still use AHA as a general-purpose VLM? Yes! It retains all the general-purpose knowledge of the base VLM that inspired it.

Jiafei Duan 的头像
Jiafei Duan1 年前

8/13🧵: AHA also scales in performance when more failure data!

Jiafei Duan 的头像
Jiafei Duan1 年前

9/13🧵: In recent years, there have been many robotics works that leverage LLMs and VLMs for generating reward functions, bounding boxes, sub-task verification, and task planning. Using AHA to provide failure explanations could enhance or accelerate the performance of LLMs/VLMs in these tasks, improving downstream results.

Jiafei Duan 的头像
Jiafei Duan1 年前

10/13🧵: We demonstrated three downstream robotic applications using AHA. First, we integrated AHA with Eureka's @JasonMa2020 formulation to accelerate the reward function search.

Jiafei Duan 的头像
Jiafei Duan1 年前

11/13🧵: We can also integrate AHA into TAMP systems that use LLMs for task-plan generation (like Proc3s @nishanthkumar23). AHA helps refine task plans to better align with human intent.

Jiafei Duan 的头像
Jiafei Duan1 年前

12/13🧵: We can use AHA to replace sub-task verifiers in systems like Manipulate-Anything, increasing task success rates.

Jiafei Duan 的头像
Jiafei Duan1 年前

13/13🧵: We're excited to see more robotics applications leverage AHA for failure reasoning feedback. Understanding failure could be key to building the 🍓 version of RoboGPT.

Jiafei Duan 的头像
Jiafei Duan1 年前

Lastly, i would like to thank all my collaborators: @wpumacay7567 @YiruHelenWang @TonyWentaoYuan @shulin_tian and my advisor @RanjayKrishna Dieter Fox and most importantly my @NVIDIARobotics mentors @AjayMandlekar Yijie Guo!

Jiafei Duan 的头像
Jiafei Duan1 年前

@YiruHelenWang @TonyWentaoYuan @shulin_tian @RanjayKrishna @NVIDIARobotics @AjayMandlekar and my buddy @nishanthkumar23

相关视频

🚨 BREAKING: Microsoft's first robotics foundation model! 🤯 Microsoft just announced Rho-alpha (ρα), their first robotics model derived from the Phi series of vision-language models. Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks. Commands like "push the green button with the right gripper," "pull out the red wire," "flip the top switch on," or "turn the knob to position 5" get executed directly by dual-arm robots. What makes this different from standard vision-language-action (VLA) models is the additional modalities. Rho-alpha is a VLA+ model that adds tactile sensing to the perceptual mix, with plans to incorporate force feedback. On the learning side, the model is designed to continually improve during deployment by learning from human feedback. The training approach combines trajectories from physical demonstrations and simulated tasks with web-scale visual question answering data. Since teleoperation data is scarce and expensive, Microsoft is using NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets via reinforcement learning. These simulated trajectories get combined with commercial and open physical demonstration datasets. The model is currently under evaluation on dual-arm setups and humanoid robots. Microsoft is opening an Early Access Program for organizations interested in evaluating Rho-alpha. Robots that can adapt to dynamic situations and human preferences are more useful in real environments and more trusted by the people operating them. Read more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

60,805 次观看 • 4 个月前