Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Diffusion models make great images. But can they drive robots? Usually that gets complicated really fast. We figured out how to get a Stable Diffusion model (based on Instruct pix2pix) to drive robotic instruction following. Simple recipe, works on a wide range of tasks. Thread👇

126,523 Aufrufe • vor 2 Jahren •via X (Twitter)

10 Kommentare

Profilbild von Sergey Levine
Sergey Levinevor 2 Jahren

The idea is very simple: Stable Diffusion is finetuned for image editing (Instruct pix2pix), and we finetune it more on robot data to predict intermediate subgoals for performing instructions. Then a goal-conditioned policy controls the robot to match the generated subgoal.

Profilbild von Sergey Levine
Sergey Levinevor 2 Jahren

Why does it work? The diffusion model transfers web-scale knowledge, and generalizes well to novel objects and scenes (that the robot never saw). The robot policy has a much easier problem to solve: it only needs to match short-term goals, often just matching arm position.

Profilbild von Sergey Levine
Sergey Levinevor 2 Jahren

Our method, SuSIE, can follow a broad range of instructions, far beyond what would be possible with only the robot data. At the same time it is efficient, and easy to use -- the Instruct pix2pix model is used without any changes (only finetuned) and the low-policy is simple GCBC.

Profilbild von Sergey Levine
Sergey Levinevor 2 Jahren

In experiments, SuSIE actually ends up beating the much much larger RT-2-X model, trained on the giant RT-X robot dataset, despite using more than an order of magnitude less robot data (about 60k). Though RT-2-X puts up a really good fight🙂

Profilbild von Sergey Levine
Sergey Levinevor 2 Jahren

We've released the code here: We hope everyone will use it! I personally think this offers probably the easiest way to boost robot capabilities with web-scale data. Use web-scale data for what it's good at: visuo-semantic association. Simple, effective.

Profilbild von Sergey Levine
Sergey Levinevor 2 Jahren

For more, see the project website: Paper here: A fun collaboration w/ @kvablack, @mitsuhiko_nm, Pranav Atreya, @HomerWalke, @chelseabfinn, @aviral_kumar2

Profilbild von Oier Mees
Oier Meesvor 2 Jahren

I wanted to highlight, that despite the authors being humble about this, they have significantly outperformed SOTA on the challenging CALVIN zero-shot benchmark ( 🙌 @kvablack @mitsuhiko_nm @HomerWalke Pranav Atreya @aviral_kumar2 @chelseabfinn @svlevine

Profilbild von Igor Gilitschenski
Igor Gilitschenskivor 2 Jahren

I love this! This type of data-generation via data-driven simulation can also alleviate the need for collecting data from some complicated or dangerous edge cases. Instead, image editing techniques can now be used to obtain the desired training data.

Profilbild von rogue node
rogue nodevor 2 Jahren

sudo make me a sandwich

Profilbild von generatorman
generatormanvor 2 Jahren

This is very interesting! It seems like you're essentially using IP2P as a low-FPS video generation model? Also, do you provide any test-time conditioning for how big a step IP2P should take? Or is that baked in by your finetuning data?

Ähnliche Videos