Video wird geladen...
Video konnte nicht geladen werden
How can robots acquire fine-grained manipulation skills? Introducing ACT: Action Chunking with Transformers 🤖 Key idea: Imitation, but predict actions in chunks instead of one at a time. Here are results with only ~15min of demonstrations, running on low-cost arms:
237,038 Aufrufe • vor 3 Jahren •via X (Twitter)
11 Kommentare

In case you missed ALOHA 🏖, the hardware we use for all these experiments, here is the thread!

Fine manipulation is difficult: either from RL, Sim2Real, or Imitation. - Hard exploration and sparse reward - Large Sim2Real gap - Compounding error for BC - No large dataset We introduce three important design choices behind ACT, an efficient imitation learning method:

(1) Predict action sequence Standard BC predicts one action at a time, while a fine manipulation task can have >1000 steps easily. Predicting action in chunks slows down compounding error, and can better model non-stationary human behavior.

(2) Generative model policy The policy is trained as the decoder of a VAE, reconstructing action chunks from latent z, 4 RGB images, and proprioception. Intuitively, z extracts the “style” of the action chunk. This is crucial when learning from human demos.

(3) Transformer We modernize the VAE by using a BERT-like encoder and a DETR-like decoder, training end-to-end from scratch. This transformer architecture benefits more from chunking than ConvNets and non-parametric methods.

With all above, ACT obtains 64%, 96%, 84%, 92% success for 4 tasks shown, with objects randomized along the 15 cm line. It does not just memorize the training data, and is able to react to external disturbances:

It is also robust to a certain level of distractor objects:

Similar to ALOHA, we open source ACT together with 2 simulated environments for reproducibility. You can find it in the project website: We hope ALOHA+ACT would be a helpful resource towards advancing fine-grained manipulation!

Personally, this is a challenging project to work on, spanning from hardware to ML. It would certainly not be possible without my amazing advisor @chelseabfinn and collaboration from @svlevine @Vikashplus!

Here are some really cool related works you should also know about! Chopstick-holding cherry-picking robot from @xkelym, trained with RL in the real world. The motion is very reactive and precise!

Diffusion policy from @chichengcc: also uses a generative model for policy. Great for fitting multi-modal data and made large progress on the RoboMimic benchmark. Also very impressive real-world experiments!


![Dynamic manipulation steps up to the plate! This is a first look at a low-impedance platform designed to study how robots manipulate objects. In this demo, two robots play catch and practice batting, even teaming up with humans. The robots are capable of throwing 70mph [112 kph] and can catch and bat at short distances (23ft [7m]) :](https://image.24vids.com/tw-1991175062083281202/media/G6ITJeNXsAAzXNa.jpg)
