Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

How can robots acquire fine-grained manipulation skills? Introducing ACT: Action Chunking with Transformers 🤖 Key idea: Imitation, but predict actions in chunks instead of one at a time. Here are results with only ~15min of demonstrations, running on low-cost arms:

Tony Zhao

71,483 subscribers

237,038 views • 3 years ago •via X (Twitter)

Arts Science & Technology Education

Anya Rossi• Live Now

Private livecam show

11 Comments

Tony Z. Zhao3 years ago

In case you missed ALOHA 🏖, the hardware we use for all these experiments, here is the thread!

Tony Z. Zhao3 years ago

Fine manipulation is difficult: either from RL, Sim2Real, or Imitation. - Hard exploration and sparse reward - Large Sim2Real gap - Compounding error for BC - No large dataset We introduce three important design choices behind ACT, an efficient imitation learning method:

Tony Z. Zhao3 years ago

(1) Predict action sequence Standard BC predicts one action at a time, while a fine manipulation task can have >1000 steps easily. Predicting action in chunks slows down compounding error, and can better model non-stationary human behavior.

Tony Z. Zhao3 years ago

(2) Generative model policy The policy is trained as the decoder of a VAE, reconstructing action chunks from latent z, 4 RGB images, and proprioception. Intuitively, z extracts the “style” of the action chunk. This is crucial when learning from human demos.

Tony Z. Zhao3 years ago

(3) Transformer We modernize the VAE by using a BERT-like encoder and a DETR-like decoder, training end-to-end from scratch. This transformer architecture benefits more from chunking than ConvNets and non-parametric methods.

Tony Z. Zhao3 years ago

With all above, ACT obtains 64%, 96%, 84%, 92% success for 4 tasks shown, with objects randomized along the 15 cm line. It does not just memorize the training data, and is able to react to external disturbances:

Tony Z. Zhao3 years ago

It is also robust to a certain level of distractor objects:

Tony Z. Zhao3 years ago

Similar to ALOHA, we open source ACT together with 2 simulated environments for reproducibility. You can find it in the project website: We hope ALOHA+ACT would be a helpful resource towards advancing fine-grained manipulation!

Tony Z. Zhao3 years ago

Personally, this is a challenging project to work on, spanning from hardware to ML. It would certainly not be possible without my amazing advisor @chelseabfinn and collaboration from @svlevine @Vikashplus!

Tony Z. Zhao3 years ago

Here are some really cool related works you should also know about! Chopstick-holding cherry-picking robot from @xkelym, trained with RL in the real world. The motion is very reactive and precise!

Tony Z. Zhao3 years ago

Diffusion policy from @chichengcc: also uses a generative model for policy. Great for fitting multi-modal data and made large progress on the RoboMimic benchmark. Also very impressive real-world experiments!

Related Videos

This will redefine “remote” work: Fine-Grained Bimanual Manipulation with Low-Cost Hardware, from Tony Zhao at Stanford. Incredible precision.

This will redefine “remote” work: Fine-Grained Bimanual Manipulation with Low-Cost Hardware, from Tony Zhao at Stanford. Incredible precision.

AI Breakfast

26,427 views • 3 years ago

Introducing RoboCat, a new AI model designed to operate multiple robots. 🤖 It learns to solve new tasks on different robotic arms with as few as 100 demonstrations - and improves skills from self-generated training data. Find out more:

Introducing RoboCat, a new AI model designed to operate multiple robots. 🤖 It learns to solve new tasks on different robotic arms with as few as 100 demonstrations - and improves skills from self-generated training data. Find out more:

Google DeepMind

410,272 views • 3 years ago

Introducing 𝐀𝐋𝐎𝐇𝐀 𝐔𝐧𝐥𝐞𝐚𝐬𝐡𝐞𝐝 🌋 - Pushing the boundaries of dexterity with low-cost robots and AI. Google DeepMind Finally got to share some videos after a few months. Robots are fully autonomous filmed in one continuous shot. Enjoy!

Introducing 𝐀𝐋𝐎𝐇𝐀 𝐔𝐧𝐥𝐞𝐚𝐬𝐡𝐞𝐝 🌋 - Pushing the boundaries of dexterity with low-cost robots and AI. Google DeepMind Finally got to share some videos after a few months. Robots are fully autonomous filmed in one continuous shot. Enjoy!

Tony Zhao

353,928 views • 2 years ago

60hz! real time chunking on an so101 with LeRobot. Not looking too bad. Bit of jitter throughout but no mode switching across chunks

60hz! real time chunking on an so101 with LeRobot. Not looking too bad. Bit of jitter throughout but no mode switching across chunks

Jack Vial

82,627 views • 5 months ago

Time to democratize humanoid robots! Introducing ToddlerBot, a low-cost ($6K), open-source humanoid for robotics and AI research. Watch two ToddlerBots seamlessly chain their loco-manipulation skills to collaborate in tidying up after a toy session.

Time to democratize humanoid robots! Introducing ToddlerBot, a low-cost ($6K), open-source humanoid for robotics and AI research. Watch two ToddlerBots seamlessly chain their loco-manipulation skills to collaborate in tidying up after a toy session.

Haochen Shi

113,410 views • 1 year ago

Traditional Chunking can lose context between chunks. (Let's explore a better way!) Enter Late Chunking… Here's how it works: Traditional Chunking • Split the text into chunks • Embed each chunk separately Late Chunking • Embed the entire text first • Split it into chunks after the embedding Advantages of Late Chunking • Maintains connections between segments • Reduces the need for complex chunking strategies • Cost-effective: extremely similar cost to regular chunking methods Late Chunking is a promising alternative to traditional methods like ColBERT and naive chunking. It's particularly useful for applications where the documents are long, and context needs to be retained across many pages of text when retrieving information. Want to learn more? • Blog post: • Notebook: Special thanks to Daniel Williams for his invaluable collaboration on this one! 🔥

Traditional Chunking can lose context between chunks. (Let's explore a better way!) Enter Late Chunking… Here's how it works: Traditional Chunking • Split the text into chunks • Embed each chunk separately Late Chunking • Embed the entire text first • Split it into chunks after the embedding Advantages of Late Chunking • Maintains connections between segments • Reduces the need for complex chunking strategies • Cost-effective: extremely similar cost to regular chunking methods Late Chunking is a promising alternative to traditional methods like ColBERT and naive chunking. It's particularly useful for applications where the documents are long, and context needs to be retained across many pages of text when retrieving information. Want to learn more? • Blog post: • Notebook: Special thanks to Daniel Williams for his invaluable collaboration on this one! 🔥

Femke Plantinga

19,718 views • 1 year ago

🤖 How can robot policies zero-shot generalize to any new environment and any new object? Introducing our new project: 🚀Data Scaling Laws in Imitation Learning for Robotic Manipulation🚀—bringing us closer to the dream of having robots work as waiters in hot pot restaurants! 🍲

🤖 How can robot policies zero-shot generalize to any new environment and any new object? Introducing our new project: 🚀Data Scaling Laws in Imitation Learning for Robotic Manipulation🚀—bringing us closer to the dream of having robots work as waiters in hot pot restaurants! 🍲

Yang Gao

124,995 views • 1 year ago

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into *in-context, low-level imitation learning machines*. 🚀 Let me explain. 👇🧵

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into in-context, low-level imitation learning machines. 🚀 Let me explain. 👇🧵

Norman Di Palo

23,094 views • 2 years ago

Dynamic manipulation steps up to the plate! This is a first look at a low-impedance platform designed to study how robots manipulate objects. In this demo, two robots play catch and practice batting, even teaming up with humans. The robots are capable of throwing 70mph [112 kph] and can catch and bat at short distances (23ft [7m]) :

Dynamic manipulation steps up to the plate! This is a first look at a low-impedance platform designed to study how robots manipulate objects. In this demo, two robots play catch and practice batting, even teaming up with humans. The robots are capable of throwing 70mph [112 kph] and can catch and bat at short distances (23ft [7m]) :

RAI Institute

25,497 views • 7 months ago

Introducing Introducing your Scone of the Month for May... rhubarb and rosemary! Filled with soft chunks of rhubarb, a hint of rosemary and a sprinkling of sweetness - this scone is currently being served in many of our cafes, but here's how you can try it at home.

Introducing Introducing your Scone of the Month for May... rhubarb and rosemary! Filled with soft chunks of rhubarb, a hint of rosemary and a sprinkling of sweetness - this scone is currently being served in many of our cafes, but here's how you can try it at home.

National Trust

64,545 views • 2 years ago

🤖ROBOTS ARE GETTING SMARTER AT TOUCHING THE REAL WORLD Researchers from UC Berkeley, NVIDIA and Stanford introduced T-Rex, a framework that combines vision, language and tactile sensing. Instead of relying on cameras alone, robots can now respond to physical contact in real time. Robots are no longer just seeing objects. They’re learning how to FEEL them.

🤖ROBOTS ARE GETTING SMARTER AT TOUCHING THE REAL WORLD Researchers from UC Berkeley, NVIDIA and Stanford introduced T-Rex, a framework that combines vision, language and tactile sensing. Instead of relying on cameras alone, robots can now respond to physical contact in real time. Robots are no longer just seeing objects. They’re learning how to FEEL them.

Coin Bureau

27,388 views • 4 days ago

Photon not only offers the fastest platform with real-time audit results, featuring one-click Quick Buy/Sell and easy limit orders, but we also provide educational content to enhance the skills of our traders. Dropping our first educational piece: What are Mevbots🤖

Photon not only offers the fastest platform with real-time audit results, featuring one-click Quick Buy/Sell and easy limit orders, but we also provide educational content to enhance the skills of our traders. Dropping our first educational piece: What are Mevbots🤖

Photon

76,210 views • 2 years ago

Can we build a generalist robotic policy that doesn’t just memorize training data and regurgitate it during test time, but instead remembers past actions as memory and conditions its decisions on them?🤖💡 Introducing SAM2Act—a multi-view robotic transformer-based policy that integrates a visual foundation model with a memory architecture for robotic manipulation. Project page: 🧵👇

Can we build a generalist robotic policy that doesn’t just memorize training data and regurgitate it during test time, but instead remembers past actions as memory and conditions its decisions on them?🤖💡 Introducing SAM2Act—a multi-view robotic transformer-based policy that integrates a visual foundation model with a memory architecture for robotic manipulation. Project page: 🧵👇

Jiafei Duan

87,573 views • 1 year ago

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Yunzhu Li

16,534 views • 11 months ago

We just released our work on robot soccer. I've been working on this for quite some time with my amazing colleagues at DeepMind. It's exciting how deep RL can produce such beautiful behaviors with low-cost robots. Full paper is available at Enjoy!

We just released our work on robot soccer. I've been working on this for quite some time with my amazing colleagues at DeepMind. It's exciting how deep RL can produce such beautiful behaviors with low-cost robots. Full paper is available at Enjoy!

Tuomas Haarnoja

758,746 views • 3 years ago

🔥 #ICRA2026 Best Paper Finalist The era of "robot VLA = single-arm gripper" is ending. Introducing Dexora — the first open-source Vision-Language-Action system for dual-arm, dual-hand, 36-DoF dexterous manipulation. 🦾 Dual Arms 🖐️ Dual Hands 🎯 36 DoF Control 🌍 Open Source Trained on: • 100K simulated trajectories • 10K real-world demonstrations Dexora achieves: ✓ 90%+ success on basic manipulation ✓ Strong dexterous manipulation performance ✓ Cross-embodiment generalization Our key hypothesis: Train on the hardest embodiment. Transfer to simpler robots later. Instead of scaling up gripper policies, we train directly in the most expressive action space and project downward to simpler embodiments. This may be a practical path toward universal robot controllers. 🎥 Demos: 📄 Paper:

🔥 #ICRA2026 Best Paper Finalist The era of "robot VLA = single-arm gripper" is ending. Introducing Dexora — the first open-source Vision-Language-Action system for dual-arm, dual-hand, 36-DoF dexterous manipulation. 🦾 Dual Arms 🖐️ Dual Hands 🎯 36 DoF Control 🌍 Open Source Trained on: • 100K simulated trajectories • 10K real-world demonstrations Dexora achieves: ✓ 90%+ success on basic manipulation ✓ Strong dexterous manipulation performance ✓ Cross-embodiment generalization Our key hypothesis: Train on the hardest embodiment. Transfer to simpler robots later. Instead of scaling up gripper policies, we train directly in the most expressive action space and project downward to simpler embodiments. This may be a practical path toward universal robot controllers. 🎥 Demos: 📄 Paper:

Hao Zhao

16,598 views • 19 days ago

Fine Gael is taking real action on housing with the launch of 133 cost-rental apartments in Tallaght.

Fine Gael is taking real action on housing with the launch of 133 cost-rental apartments in Tallaght.

Fine Gael

78,714 views • 1 year ago

Released in 1984, The Transformers Was Easily One of the Greatest 80s Toylines of All Time. The Idea That Cars, Planes and Dinosaurs Would Transform Into Robots Was Next Level Brilliance. #transformers #g1transformers #OptimusPrime #megatron #starscream

Released in 1984, The Transformers Was Easily One of the Greatest 80s Toylines of All Time. The Idea That Cars, Planes and Dinosaurs Would Transform Into Robots Was Next Level Brilliance. #transformers #g1transformers #OptimusPrime #megatron #starscream

80sThen80sNow

30,282 views • 2 months ago

With one of the most effective weapons in the Ukraine war being a small, low-cost first-person view drone, here's a look at how warfare changed in 2025

With one of the most effective weapons in the Ukraine war being a small, low-cost first-person view drone, here's a look at how warfare changed in 2025

Reuters

47,749 views • 6 months ago

Introducing 𝐌𝐨𝐛𝐢𝐥𝐞 𝐀𝐋𝐎𝐇𝐀🏄 -- Hardware! A low-cost, open-source, mobile manipulator. One of the most high-effort projects in my past 5yrs! Not possible without co-lead Zipeng Fu and Chelsea Finn. At the end, what's better than cooking yourself a meal with the 🤖🧑‍🍳

Introducing 𝐌𝐨𝐛𝐢𝐥𝐞 𝐀𝐋𝐎𝐇𝐀🏄 -- Hardware! A low-cost, open-source, mobile manipulator. One of the most high-effort projects in my past 5yrs! Not possible without co-lead Zipeng Fu and Chelsea Finn. At the end, what's better than cooking yourself a meal with the 🤖🧑‍🍳

Tony Zhao

1,668,299 views • 2 years ago