Loading video...

Video Failed to Load

Go Home

How can robots acquire fine-grained manipulation skills? Introducing ACT: Action Chunking with Transformers ๐Ÿค– Key idea: Imitation, but predict actions in chunks instead of one at a time. Here are results with only ~15min of demonstrations, running on low-cost arms:

237,038 views โ€ข 3 years ago โ€ขvia X (Twitter)

11 Comments

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

In case you missed ALOHA ๐Ÿ–, the hardware we use for all these experiments, here is the thread!

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

Fine manipulation is difficult: either from RL, Sim2Real, or Imitation. - Hard exploration and sparse reward - Large Sim2Real gap - Compounding error for BC - No large dataset We introduce three important design choices behind ACT, an efficient imitation learning method:

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

(1) Predict action sequence Standard BC predicts one action at a time, while a fine manipulation task can have >1000 steps easily. Predicting action in chunks slows down compounding error, and can better model non-stationary human behavior.

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

(2) Generative model policy The policy is trained as the decoder of a VAE, reconstructing action chunks from latent z, 4 RGB images, and proprioception. Intuitively, z extracts the โ€œstyleโ€ of the action chunk. This is crucial when learning from human demos.

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

(3) Transformer We modernize the VAE by using a BERT-like encoder and a DETR-like decoder, training end-to-end from scratch. This transformer architecture benefits more from chunking than ConvNets and non-parametric methods.

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

With all above, ACT obtains 64%, 96%, 84%, 92% success for 4 tasks shown, with objects randomized along the 15 cm line. It does not just memorize the training data, and is able to react to external disturbances:

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

It is also robust to a certain level of distractor objects:

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

Similar to ALOHA, we open source ACT together with 2 simulated environments for reproducibility. You can find it in the project website: We hope ALOHA+ACT would be a helpful resource towards advancing fine-grained manipulation!

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

Personally, this is a challenging project to work on, spanning from hardware to ML. It would certainly not be possible without my amazing advisor @chelseabfinn and collaboration from @svlevine @Vikashplus!

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

Here are some really cool related works you should also know about! Chopstick-holding cherry-picking robot from @xkelym, trained with RL in the real world. The motion is very reactive and precise!

Tony Z. Zhao's profile picture
Tony Z. Zhao3 years ago

Diffusion policy from @chichengcc: also uses a generative model for policy. Great for fitting multi-modal data and made large progress on the RoboMimic benchmark. Also very impressive real-world experiments!

Related Videos