Loading video...

Video Failed to Load

Go Home

Text-to-image generative models, meet robotics! We present ROSIE: Scaling RObot Learning with Semantically Imagined Experience, where we augment real robotics data with semantically imagined scenarios for downstream manipulation learning. Website: 🧵👇

196,378 views • 3 years ago •via X (Twitter)

16 Comments

Fei Xia's profile picture
Fei Xia3 years ago

It is incredibly resource consuming to collect real-world robotics data, for example it takes our robot fleet of 13 mobile manipulators 17 months to collect 130k manipulation episodes. Can we extend these data “for free” by augmenting them?

Fei Xia's profile picture
Fei Xia3 years ago

We present a system that automatically augments robot training data. All you need to tell it is a source task “place coke can into top drawer” and a target task “place coke can into cluttered top drawer”. The system outputs a few augmentation schemes, including masks and edits.

Fei Xia's profile picture
Fei Xia3 years ago

We use an open-vocabulary image segmentation model derived from OWL-ViT( for mask generation, and Imagen-Editor ( for image inpainting. We then train an RT-1 ( policy on top of the mixed data.

Fei Xia's profile picture
Fei Xia3 years ago

A few interesting things we found along the way:

Fei Xia's profile picture
Fei Xia3 years ago

1) We can complete tasks **only seen through** diffusion models. For example, we augment “putting objects in drawer” tasks into “putting objects in sink”, by reimagining the drawer as a metal sink. The policy trained on the mixed data is able to put objects into the sink!

Fei Xia's profile picture
Fei Xia3 years ago

2) Generative data augmentation works for high-dimensional continuous action space and image frames. Our action space is the end-effector delta pose in 3D space. And the input is image frames. This is in contrast with other works in diffusion augmentation for perception.

Fei Xia's profile picture
Fei Xia3 years ago

Although our work doesn’t guarantee temporal consistency, the high-capability architecture (RT-1) is able to handle the flickering in the frames and still generalize to the real world. For example, here is the training data and real-world rollout for "picking up <color> cloth"

Fei Xia's profile picture
Fei Xia3 years ago

3) The augmentation is photorealistic and simulates rich visual nuances. Previously we have explored the knowledge and information encoded in vision-language foundation models in our work Socratic Models and Inner Monologue. This time we investigate the other side of the coin.

Fei Xia's profile picture
Fei Xia3 years ago

There is vast knowledge encoded in those diffusion models and to our surprise, there are even signs of life that they understand some physics by modeling the image formation process, see how the generated cloth has folds within the gripper pinch. Warrant further investigation

Fei Xia's profile picture
Fei Xia3 years ago

We think there are a few directions that could be further explored. It seems that diffusion models can act as a supplementary source of data to simulation. We know that we can do sim-to-real, maybe we can also do diffusion-to-real, or combine both?

Fei Xia's profile picture
Fei Xia3 years ago

Text guidance is very important here because we can synthesize data for specific scenarios / long tail distribution of low data regimes. Essentially we get image and text alignment for free, which is fundamentally different from previous augmentation methods.

Fei Xia's profile picture
Fei Xia3 years ago

Our current method is somewhat limited by temporal consistency, we aim at incorporating video diffusion models, and masked transformers for better data augmentation. It's also possible to incorporate ControlNet to better harness those models.

Fei Xia's profile picture
Fei Xia3 years ago

The generative AI + robotics is a vibrant community. Shout out to our concurrent works GenAug and CACTI.

Fei Xia's profile picture
Fei Xia3 years ago

This work is lead by @TianheYu, in collaboration with an amazing team @xiao_ted, Austin Stone, @JonathanTompson Anthony Brohan, Su Wang, @brian_ichter and @hausman_k 🙌🙌

Fei Xia's profile picture
Fei Xia3 years ago

Visit out website to learn more, also checkout our interactive demo on the website. While we used Imagen-Editor for our work, the method is compatible with stable diffusion as well, we aim at open sourcing a specific version soon.

MightyBot's profile picture
MightyBot1 year ago

🧠 Unified Search. Smarter Meetings. Effortless CRM. MightyBot is your AI agent platform for seamless workflows—record meetings, automate CRM updates, and find answers across apps in seconds. 🌟 Focus on what matters. We'll handle the grind.

Related Videos