Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

Jiafei Duan

5,743 subscribers

48,739 views • 1 year ago •via X (Twitter)

Education Health & Wellness Science & Technology

Anya Rossi• Live Now

Private livecam show

15 Comments

Jiafei Duan1 year ago

1/13🧵: Failure detection in robotics is challenging. Traditionally, we've used trained classifiers or heuristics, framing it as a binary classification problem. Recently, there's been a shift toward using LLMs and VLMs as open-world failure detectors. However, just as we improve by understanding failures, our robots should too.

Jiafei Duan1 year ago

2/13🧵: Behind every successful robot demonstration, there are tons of failed attempts from tele-operation or policy rollouts—but these failures are often not collected. How can we scale up robot failure data for instruction-tuning a VLM to reason about failure?

Jiafei Duan1 year ago

3/13🧵: To systematically curate failures in simulation, we introduce FailGen—a custom environment wrapper for generating failures for robotic manipulation in any simulator. It takes successful robot demonstrations and perturbs them to generate failures systematically. Could generating instruction-tuning data for robotics in simulation be a new paradigm shift?

Jiafei Duan1 year ago

4/13🧵: Using FailGen, we procedurally generated the largest failure dataset for robotic manipulation covering 79 tasks from RLBench (@stepjamUK) over 49K data points, along with corresponding failure explanations for instruction-tuning AHA.

Jiafei Duan1 year ago

5/13🧵: We instruction-tuned AHA-13B using a mix of real VQA data and our procedurally generated robotic failure data. Surprisingly, adding synthetic data with templated language improved the model's performance.

Jiafei Duan1 year ago

6/13🧵:AHA-13B generalizes failure reasoning across different embodiments, unseen domains, and novel tasks. It outperforms other state-of-the-art VLMs on three failure datasets: RoboFail (Test), ManiSkill-Fail @Stone_Tao, and REFLECT from @Liu_Zeyi_ !

Jiafei Duan1 year ago

7/13🧵: Can we still use AHA as a general-purpose VLM? Yes! It retains all the general-purpose knowledge of the base VLM that inspired it.

Jiafei Duan1 year ago

8/13🧵: AHA also scales in performance when more failure data!

Jiafei Duan1 year ago

9/13🧵: In recent years, there have been many robotics works that leverage LLMs and VLMs for generating reward functions, bounding boxes, sub-task verification, and task planning. Using AHA to provide failure explanations could enhance or accelerate the performance of LLMs/VLMs in these tasks, improving downstream results.

Jiafei Duan1 year ago

10/13🧵: We demonstrated three downstream robotic applications using AHA. First, we integrated AHA with Eureka's @JasonMa2020 formulation to accelerate the reward function search.

Jiafei Duan1 year ago

11/13🧵: We can also integrate AHA into TAMP systems that use LLMs for task-plan generation (like Proc3s @nishanthkumar23). AHA helps refine task plans to better align with human intent.

Jiafei Duan1 year ago

12/13🧵: We can use AHA to replace sub-task verifiers in systems like Manipulate-Anything, increasing task success rates.

Jiafei Duan1 year ago

13/13🧵: We're excited to see more robotics applications leverage AHA for failure reasoning feedback. Understanding failure could be key to building the 🍓 version of RoboGPT.

Jiafei Duan1 year ago

Lastly, i would like to thank all my collaborators: @wpumacay7567 @YiruHelenWang @TonyWentaoYuan @shulin_tian and my advisor @RanjayKrishna Dieter Fox and most importantly my @NVIDIARobotics mentors @AjayMandlekar Yijie Guo!

Jiafei Duan1 year ago

@YiruHelenWang @TonyWentaoYuan @shulin_tian @RanjayKrishna @NVIDIARobotics @AjayMandlekar and my buddy @nishanthkumar23

Related Videos

Can we build a generalist robotic policy that doesn’t just memorize training data and regurgitate it during test time, but instead remembers past actions as memory and conditions its decisions on them?🤖💡 Introducing SAM2Act—a multi-view robotic transformer-based policy that integrates a visual foundation model with a memory architecture for robotic manipulation. Project page: 🧵👇

Can we build a generalist robotic policy that doesn’t just memorize training data and regurgitate it during test time, but instead remembers past actions as memory and conditions its decisions on them?🤖💡 Introducing SAM2Act—a multi-view robotic transformer-based policy that integrates a visual foundation model with a memory architecture for robotic manipulation. Project page: 🧵👇

Jiafei Duan

87,573 views • 1 year ago

Qwen-VLA feels like one of the first real robotics foundation models. A single system trained across robot manipulation, navigation, egocentric human video, simulation, and vision-language reasoning instead of isolated robot policies.

Qwen-VLA feels like one of the first real robotics foundation models. A single system trained across robot manipulation, navigation, egocentric human video, simulation, and vision-language reasoning instead of isolated robot policies.

Robots Digest 🤖

14,601 views • 21 days ago

Chain-of-thought reasoning is a powerful tool to enable language models to work through complex problems. Can we use this with robots? With embodied chain-of-thought, vision-language-action (VLA) models can think through perception and planning! A 🧵👇

Chain-of-thought reasoning is a powerful tool to enable language models to work through complex problems. Can we use this with robots? With embodied chain-of-thought, vision-language-action (VLA) models can think through perception and planning! A 🧵👇

Sergey Levine

30,388 views • 1 year ago

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

AK

46,778 views • 2 years ago

Yay, finally! Introducing Vision Banana🍌 from Google DeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: (1/5)

Yay, finally! Introducing Vision Banana🍌 from Google DeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: (1/5)

Songyou Peng

283,432 views • 1 month ago

What if robots could improve themselves by learning from their own failures in the real-world? Introducing 𝗣𝗟𝗗 (𝗣𝗿𝗼𝗯𝗲, 𝗟𝗲𝗮𝗿𝗻, 𝗗𝗶𝘀𝘁𝗶𝗹𝗹) — a recipe that enables Vision-Language-Action (VLA) models to self-improve for high-precision manipulation tasks. PLD couples real-world residual reinforcement learning with standard supervised fine-tuning — letting robots discover, recover, and distill their own data flywheel. Quick 🧵

What if robots could improve themselves by learning from their own failures in the real-world? Introducing 𝗣𝗟𝗗 (𝗣𝗿𝗼𝗯𝗲, 𝗟𝗲𝗮𝗿𝗻, 𝗗𝗶𝘀𝘁𝗶𝗹𝗹) — a recipe that enables Vision-Language-Action (VLA) models to self-improve for high-precision manipulation tasks. PLD couples real-world residual reinforcement learning with standard supervised fine-tuning — letting robots discover, recover, and distill their own data flywheel. Quick 🧵

Wenli Xiao

184,685 views • 7 months ago

How to harness foundation models for *generalization in the wild* in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

How to harness foundation models for generalization in the wild in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

Wenlong Huang

293,876 views • 2 years ago

Non-robustness hints at paradigm failures. Reasoning can improve robustness. Alexander Wei explores reasoning-based defenses that let models ‘think’ before responding, helping counter adversarial attacks and strengthen AI safety."

Non-robustness hints at paradigm failures. Reasoning can improve robustness. Alexander Wei explores reasoning-based defenses that let models ‘think’ before responding, helping counter adversarial attacks and strengthen AI safety."

FAR.AI

508,419 views • 1 year ago

Today, we are introducing RFM-1, our Robotics Foundation Model giving robots human-like reasoning capabilities.

Today, we are introducing RFM-1, our Robotics Foundation Model giving robots human-like reasoning capabilities.

Covariant

118,306 views • 2 years ago

We can teach LLMs to write better robot code through natural language feedback. But can LLMs remember what they were taught and improve their teachability over time? Introducing our latest work, Learning to Learn Faster from Human Feedback with Language Model Predictive Control

We can teach LLMs to write better robot code through natural language feedback. But can LLMs remember what they were taught and improve their teachability over time? Introducing our latest work, Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Jacky Liang

86,652 views • 2 years ago

Introducing Collaborative Reasoner: a framework to improve collaborative reasoning in language models. Collaborative Reasoner paves the way for developing social agents that can partner with humans and other agents. Read the research paper and download the code.

Introducing Collaborative Reasoner: a framework to improve collaborative reasoning in language models. Collaborative Reasoner paves the way for developing social agents that can partner with humans and other agents. Read the research paper and download the code.

AI at Meta

58,510 views • 1 year ago

We are back. After one year of quiet building. Introducing GENE-26.5, our first robotic brain that takes a major step toward human-level capability. For years, robotics has struggled to learn from the world’s largest and valuable data source: Humans. Solving it means rethinking the whole stack from the ground up: - A robotics-native foundation model. - A 1:1 human-like robotic hand. - A noninvasive data collection glove for motion, force, and touch. - A simulator that turns weeks of experiments into minutes. GENE-26.5 is trained across language, vision, proprioception, tactile, and action. We designed a set of tasks to test how far we can go with this new paradigm. Fully autonomous, 1x speed, one model, same weights. (Enjoy with sound on) We are approaching the endgame for robotics. And this is just a beginning.

We are back. After one year of quiet building. Introducing GENE-26.5, our first robotic brain that takes a major step toward human-level capability. For years, robotics has struggled to learn from the world’s largest and valuable data source: Humans. Solving it means rethinking the whole stack from the ground up: - A robotics-native foundation model. - A 1:1 human-like robotic hand. - A noninvasive data collection glove for motion, force, and touch. - A simulator that turns weeks of experiments into minutes. GENE-26.5 is trained across language, vision, proprioception, tactile, and action. We designed a set of tasks to test how far we can go with this new paradigm. Fully autonomous, 1x speed, one model, same weights. (Enjoy with sound on) We are approaching the endgame for robotics. And this is just a beginning.

Genesis AI

2,696,061 views • 1 month ago

Introducing General Intuition and our $133.7M Seed from Khosla Ventures, General Catalyst, and Raine. We build foundation models and general agents for environments that require deep spatial and temporal reasoning.

Introducing General Intuition and our $133.7M Seed from Khosla Ventures, General Catalyst, and Raine. We build foundation models and general agents for environments that require deep spatial and temporal reasoning.

General Intuition

2,275,179 views • 8 months ago

State-of-the-art robot policies often need hundreds of hours of data. What if we needed none? Introducing TiPToP: a manipulation system that zero-shots open-world tasks from pixels and language using vision foundation models and GPU-parallelized Task and Motion Planning (TAMP).

State-of-the-art robot policies often need hundreds of hours of data. What if we needed none? Introducing TiPToP: a manipulation system that zero-shots open-world tasks from pixels and language using vision foundation models and GPU-parallelized Task and Motion Planning (TAMP).

Nishanth Kumar

77,439 views • 3 months ago

Foundation models are enough to solve robotics! Unfortunately, this is not true. We keep hearing that Vision-Language-Action (VLA) models struggle because of the gap between static training and the dynamic real world. A German startup (Sereact) just released a solution that bridges this gap perfectly. They are introducing a new paradigm called Interactive RL Policy Patching. It's a distributed framework that allows robots to self-learn from human corrections without needing full retraining. When a robot fails, a human operator provides a brief "patch" or demonstration. The system then uses online off-policy reinforcement learning to update the behavior instantly. This is powered by a massive foundation model trained on hundreds of millions of interactions from over 100 deployed robot stations. The best part is the distributed parameter synchronization... When one robot learns a fix, the update is published fleet-wide... meaning the entire swarm gets smarter from a single human intervention. They are already proving this on complex manipulation tasks like shoe unboxing and screw sorting, drastically reducing the data needed to handle edge cases. Real-world environments are unforgiving, and I love seeing systems that can actually adapt on the fly! 📍 More info:

Foundation models are enough to solve robotics! Unfortunately, this is not true. We keep hearing that Vision-Language-Action (VLA) models struggle because of the gap between static training and the dynamic real world. A German startup (Sereact) just released a solution that bridges this gap perfectly. They are introducing a new paradigm called Interactive RL Policy Patching. It's a distributed framework that allows robots to self-learn from human corrections without needing full retraining. When a robot fails, a human operator provides a brief "patch" or demonstration. The system then uses online off-policy reinforcement learning to update the behavior instantly. This is powered by a massive foundation model trained on hundreds of millions of interactions from over 100 deployed robot stations. The best part is the distributed parameter synchronization... When one robot learns a fix, the update is published fleet-wide... meaning the entire swarm gets smarter from a single human intervention. They are already proving this on complex manipulation tasks like shoe unboxing and screw sorting, drastically reducing the data needed to handle edge cases. Real-world environments are unforgiving, and I love seeing systems that can actually adapt on the fly! 📍 More info:

Ilir Aliu

18,210 views • 4 months ago

What happens when vision🤝 robotics meet? Happy to share our new work on Pretraining Robotic Foundational Models!🔥 ARM4R is an Autoregressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better robotic model. Berkeley AI Research😊

What happens when vision🤝 robotics meet? Happy to share our new work on Pretraining Robotic Foundational Models!🔥 ARM4R is an Autoregressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better robotic model. Berkeley AI Research😊

Roei Herzig

62,711 views • 1 year ago

A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)

A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)

Jesse Zhang

99,840 views • 3 months ago

Language following is a tough problem for VLAs: while these models can follow complex language, in practice getting datasets that enable language following is hard. We developed a method to counterfactually and automatically label data to improve language following! 🧵👇

Language following is a tough problem for VLAs: while these models can follow complex language, in practice getting datasets that enable language following is hard. We developed a method to counterfactually and automatically label data to improve language following! 🧵👇

Sergey Levine

44,176 views • 10 months ago

🚨 BREAKING: Microsoft's first robotics foundation model! 🤯 Microsoft just announced Rho-alpha (ρα), their first robotics model derived from the Phi series of vision-language models. Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks. Commands like "push the green button with the right gripper," "pull out the red wire," "flip the top switch on," or "turn the knob to position 5" get executed directly by dual-arm robots. What makes this different from standard vision-language-action (VLA) models is the additional modalities. Rho-alpha is a VLA+ model that adds tactile sensing to the perceptual mix, with plans to incorporate force feedback. On the learning side, the model is designed to continually improve during deployment by learning from human feedback. The training approach combines trajectories from physical demonstrations and simulated tasks with web-scale visual question answering data. Since teleoperation data is scarce and expensive, Microsoft is using NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets via reinforcement learning. These simulated trajectories get combined with commercial and open physical demonstration datasets. The model is currently under evaluation on dual-arm setups and humanoid robots. Microsoft is opening an Early Access Program for organizations interested in evaluating Rho-alpha. Robots that can adapt to dynamic situations and human preferences are more useful in real environments and more trusted by the people operating them. Read more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

🚨 BREAKING: Microsoft's first robotics foundation model! 🤯 Microsoft just announced Rho-alpha (ρα), their first robotics model derived from the Phi series of vision-language models. Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks. Commands like "push the green button with the right gripper," "pull out the red wire," "flip the top switch on," or "turn the knob to position 5" get executed directly by dual-arm robots. What makes this different from standard vision-language-action (VLA) models is the additional modalities. Rho-alpha is a VLA+ model that adds tactile sensing to the perceptual mix, with plans to incorporate force feedback. On the learning side, the model is designed to continually improve during deployment by learning from human feedback. The training approach combines trajectories from physical demonstrations and simulated tasks with web-scale visual question answering data. Since teleoperation data is scarce and expensive, Microsoft is using NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets via reinforcement learning. These simulated trajectories get combined with commercial and open physical demonstration datasets. The model is currently under evaluation on dual-arm setups and humanoid robots. Microsoft is opening an Early Access Program for organizations interested in evaluating Rho-alpha. Robots that can adapt to dynamic situations and human preferences are more useful in real environments and more trusted by the people operating them. Read more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

60,837 views • 4 months ago

Another day, another humanoid robot from china AGIBOT introduces GO-1, a generalist foundation model that integrates a vision-language model with a latent planner for enhanced long-horizon and dexterous manipulation.

Another day, another humanoid robot from china AGIBOT introduces GO-1, a generalist foundation model that integrates a vision-language model with a latent planner for enhanced long-horizon and dexterous manipulation.

Chubby♨️

23,513 views • 1 year ago