RoboPapers's banner

RoboPapers

@RoboPapers • 5,743 subscribers

@chris_j_paxton, @micoolcho & @DJiafei geeking out weekly with authors of robotics AI papers. On YouTube / X / Spotify / Substack

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Robot policies must be both reliable and highly capable to be useful; the best way to achieve this level of performance is with reinforcement learning. However, for reinforcement learning you are usually stuck between two difficult options: reinforcement in the real world is often risky and expensive, while reinforcement learning in a traditional simulator takes a lot of engineering work and has a persistent sim-to-real gap. What if instead you could train your robot purely in a world model? RISE by Jiazhi Yang et al. uses a compositional world model to predict the future and evaluate progress. This allows for a self-improving pipeline, which learns a world model from real data and then learns how the robot should perform different tasks. This pipeline results in a data-driven way to improve policy performance from real data but without real-world reinforcement learning. Watch Episode #86 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

Robot policies must be both reliable and highly capable to be useful; the best way to achieve this level of performance is with reinforcement learning. However, for reinforcement learning you are usually stuck between two difficult options: reinforcement in the real world is often risky and expensive, while reinforcement learning in a traditional simulator takes a lot of engineering work and has a persistent sim-to-real gap. What if instead you could train your robot purely in a world model? RISE by Jiazhi Yang et al. uses a compositional world model to predict the future and evaluate progress. This allows for a self-improving pipeline, which learns a world model from real data and then learns how the robot should perform different tasks. This pipeline results in a data-driven way to improve policy performance from real data but without real-world reinforcement learning. Watch Episode #86 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

38,334 views • 1 month ago

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about Mimic Robotic. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about Mimic Robotic. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!

46,190 views • 2 months ago

Collecting robot data at scale is key to deploying working manipulation policies, and the team from Tutor Intelligence Tutor is here to tell us about how to accomplish it. Their new announcement: a massive, 100-robot “data factory,” with a behind-the-scenes look at how to build a teleoperation platform and how to make robots and policies that are useful for their customers. Tutor Intelligence is a full-stack robotics company: they build robot arms, they sell robot arms, they write the software and they train neural networks. Josh Gruenstein, Jesse Michel, Shiraz, and Joe McCalmon, and Joe McCalmon join us to tell us more about how they scale both teleop data and human interventions from their teleoperators in order to train the policies they need. Watch Episode #85 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

Collecting robot data at scale is key to deploying working manipulation policies, and the team from Tutor Intelligence Tutor is here to tell us about how to accomplish it. Their new announcement: a massive, 100-robot “data factory,” with a behind-the-scenes look at how to build a teleoperation platform and how to make robots and policies that are useful for their customers. Tutor Intelligence is a full-stack robotics company: they build robot arms, they sell robot arms, they write the software and they train neural networks. Josh Gruenstein, Jesse Michel, Shiraz, and Joe McCalmon, and Joe McCalmon join us to tell us more about how they scale both teleop data and human interventions from their teleoperators in order to train the policies they need. Watch Episode #85 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

36,041 views • 1 month ago

3 of us Michael Cho - Rbt/Acc Chris Paxton Jiafei Duan are super excited to help organize the Robotic Origami Competition at IROS (Sept 2026), along with BitRobot 🦾 Sharpa Lightwheel Haoquan Fang Santiago Pravisani Noriaki Hirose Yang Gao Calling for teams!

3 of us Michael Cho - Rbt/Acc Chris Paxton Jiafei Duan are super excited to help organize the Robotic Origami Competition at IROS (Sept 2026), along with BitRobot 🦾 Sharpa Lightwheel Haoquan Fang Santiago Pravisani Noriaki Hirose Yang Gao Calling for teams!

37,145 views • 1 month ago

Robotics has changed dramatically over the last eight years. Ted Xiao has been involved in the cutting edge of robot learning through this period, spending those eight years at Google Brain/Google Deepmind. And he’s identified three eras of robot learning. These eras are: - The Era of Existence Proofs - trying different methods like QT-Opt, on-robot RL - The Era of Foundation Models - transitioning to data collection and clean objectives (i.e. supervised learning) - The Era of Scaling - orders of magnitude more data and larger models, enabling reasoning, long-horizon actions, and cross-embodiment transfer Watch Episode 78 of RoboPapers, with Michael Cho - Rbt/Acc and Jiafei Duan to learn more!

Robotics has changed dramatically over the last eight years. Ted Xiao has been involved in the cutting edge of robot learning through this period, spending those eight years at Google Brain/Google Deepmind. And he’s identified three eras of robot learning. These eras are: - The Era of Existence Proofs - trying different methods like QT-Opt, on-robot RL - The Era of Foundation Models - transitioning to data collection and clean objectives (i.e. supervised learning) - The Era of Scaling - orders of magnitude more data and larger models, enabling reasoning, long-horizon actions, and cross-embodiment transfer Watch Episode 78 of RoboPapers, with Michael Cho - Rbt/Acc and Jiafei Duan to learn more!

36,520 views • 2 months ago

With enough data, robots and AI can learn “world models” that let them predict the results of their actions. These models are a way to learn how embodied AI agents can perform a wide variety of useful tasks — but they require a huge amount of data. The team at General Intuition General Intuition has a solution: use data from video games! Games teach movement, problem solving, and complex spatial reasoning, and they come in a staggering diversity of forms, covering a wide variety of problems. What’s more, the captured data is high-quality, without the noise or annotation error that can come from We sat down with Pim de Witte and Adam Jelley from the General Intuition team to learn more about their history, their plans, and their philosophy.

With enough data, robots and AI can learn “world models” that let them predict the results of their actions. These models are a way to learn how embodied AI agents can perform a wide variety of useful tasks — but they require a huge amount of data. The team at General Intuition General Intuition has a solution: use data from video games! Games teach movement, problem solving, and complex spatial reasoning, and they come in a staggering diversity of forms, covering a wide variety of problems. What’s more, the captured data is high-quality, without the noise or annotation error that can come from We sat down with Pim de Witte and Adam Jelley from the General Intuition team to learn more about their history, their plans, and their philosophy.

85,927 views • 8 months ago

Sports like tennis are great examples of the sort of dynamic whole-body interaction that’s possible with humanoid robots. But capturing examples of fast, dynamic interactions from humans is really difficult. Enter LATENT, which uses lower-quality human data plus reinforcement learning to teach a robot to play tennis, able to complete back-and-forth volleys at a human level. LATENT has three steps: (1) collecting imperfect human data like a backswing, (2) using these to learn a latent action space, and (3) they train a high-level policy in simulation which can compose these actions and execute tennis skills on a robot. Haofei Lu and Yunrui Lian join us to tell us about their method. Watch Episode #80 of RoboPapers, with Chris Paxton and Jiafei Duan, now to learn more!

Sports like tennis are great examples of the sort of dynamic whole-body interaction that’s possible with humanoid robots. But capturing examples of fast, dynamic interactions from humans is really difficult. Enter LATENT, which uses lower-quality human data plus reinforcement learning to teach a robot to play tennis, able to complete back-and-forth volleys at a human level. LATENT has three steps: (1) collecting imperfect human data like a backswing, (2) using these to learn a latent action space, and (3) they train a high-level policy in simulation which can compose these actions and execute tennis skills on a robot. Haofei Lu and Yunrui Lian join us to tell us about their method. Watch Episode #80 of RoboPapers, with Chris Paxton and Jiafei Duan, now to learn more!

29,923 views • 2 months ago

Robots has a data problem, in that robotics data is rare. While human video is quite common, it’s not usually directly usable for robots for a variety of reasons, most significantly that it’s missing explicit, accurate robot actions. Instead, Jeremy Collins proposes that we predict keypoint trajectories — basically, how any given point in an object will move as a robot performs a task. This lets us use action-free human video to train robot skills. Learn more by watching Episode #37 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton now.

Robots has a data problem, in that robotics data is rare. While human video is quite common, it’s not usually directly usable for robots for a variety of reasons, most significantly that it’s missing explicit, accurate robot actions. Instead, Jeremy Collins proposes that we predict keypoint trajectories — basically, how any given point in an object will move as a robot performs a task. This lets us use action-free human video to train robot skills. Learn more by watching Episode #37 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton now.

89,011 views • 9 months ago

Robots, unfortunately, tend to be expensive. And finding a robot that’s both capable of performing a wide variety of mobile manipulation tasks, and is affordable and “hackable”, is extremely difficult. Many different problems need to be addressed, from arm control to navigation to integrating your data collection strategy into hardware design. This can make it difficult for all but the most well-funded teams to “scale” real-world robotics research. Fortunately, the team behind Build Your Own Robot has a solution. Manan Anjaria, Mahi Shafiullah 🏠🤖,Jeff Cui, and Enes Erciyes joined us to talk about how they build a fully open-source mobile manipulator out of off-the-shelf parts, which has humanlike range of motion, and can perform a wide variety of tasks, all while being only roughly $10,000 to build. Watch Episode 71 of RoboPapers, with Michael Cho - Rbt/Acc and Chris Paxton, today to learn more!

Robots, unfortunately, tend to be expensive. And finding a robot that’s both capable of performing a wide variety of mobile manipulation tasks, and is affordable and “hackable”, is extremely difficult. Many different problems need to be addressed, from arm control to navigation to integrating your data collection strategy into hardware design. This can make it difficult for all but the most well-funded teams to “scale” real-world robotics research. Fortunately, the team behind Build Your Own Robot has a solution. Manan Anjaria, Mahi Shafiullah 🏠🤖,Jeff Cui, and Enes Erciyes joined us to talk about how they build a fully open-source mobile manipulator out of off-the-shelf parts, which has humanlike range of motion, and can perform a wide variety of tasks, all while being only roughly $10,000 to build. Watch Episode 71 of RoboPapers, with Michael Cho - Rbt/Acc and Chris Paxton, today to learn more!

34,151 views • 3 months ago

Training robot foundation models faces two key hurdles: how to get enough data to train an effective model, and how to make sure that new skills can be acquired quickly. The team at Rhoda AI believes that the answer is training Direct Video Action models from web data. Web data is plentiful, to the point where Rhoda can train their base model on hundreds of years of video data. And then, with the addition of robot data, they can quickly adapt it to new tasks with as little as 20 hours of in-domain data, performing complex, multi-step manipulation tasks with their purpose-built video foundation model. Tongzhou Mu 🤖🦾🦿 Eric Chan and Changan Chen joined us to talk more about their approach. Watch Episode #79 of RoboPapers, with Michael Cho - Rbt/Acc, Chris Paxton, and Jiafei Duan, to learn more!

Training robot foundation models faces two key hurdles: how to get enough data to train an effective model, and how to make sure that new skills can be acquired quickly. The team at Rhoda AI believes that the answer is training Direct Video Action models from web data. Web data is plentiful, to the point where Rhoda can train their base model on hundreds of years of video data. And then, with the addition of robot data, they can quickly adapt it to new tasks with as little as 20 hours of in-domain data, performing complex, multi-step manipulation tasks with their purpose-built video foundation model. Tongzhou Mu 🤖🦾🦿 Eric Chan and Changan Chen joined us to talk more about their approach. Watch Episode #79 of RoboPapers, with Michael Cho - Rbt/Acc, Chris Paxton, and Jiafei Duan, to learn more!

24,475 views • 2 months ago

Teaching robots to perform dexterous manipulation tasks currently requires teleoperation, which limits demonstration quality, speed, and scalability. Instead, why not use human videos? The problem is that a human hand isn’t a robot hand, so data must be retargeted using simulation to resolve issues like collisions and interpenetration when controlling the hand. In VideoManip, Hongyi Chen and co-authors built a system to solve this problem, taking in RGB videos of humans performing manipulation tasks and using them to create accurate simulations with which to learn robot policies. Watch episode #73 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton, now to learn more!

Teaching robots to perform dexterous manipulation tasks currently requires teleoperation, which limits demonstration quality, speed, and scalability. Instead, why not use human videos? The problem is that a human hand isn’t a robot hand, so data must be retargeted using simulation to resolve issues like collisions and interpenetration when controlling the hand. In VideoManip, Hongyi Chen and co-authors built a system to solve this problem, taking in RGB videos of humans performing manipulation tasks and using them to create accurate simulations with which to learn robot policies. Watch episode #73 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton, now to learn more!

27,249 views • 3 months ago

Full episode dropping soon! Geeking out with Jiazhi Yang on RISE: Self-Improving Robot Policy with Compositional World Model Co-hosted by Chris Paxton Jiafei Duan

Full episode dropping soon! Geeking out with Jiazhi Yang on RISE: Self-Improving Robot Policy with Compositional World Model Co-hosted by Chris Paxton Jiafei Duan

14,029 views • 1 month ago

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

16,197 views • 1 month ago

How can we build a general-purpose “foundation model” for robot motion? Zhengyi “Zen” Luo joins us to talk about SONIC, which uses motion tracking as a foundational task for humanoid robot control, and scales humanoid control training to 9k GPU hours and 100 million frames worth of data. The result: a model with a generally-useful embedding space that can be controlled by a VLA, or from human video, to perform a wide variety of humanoid whole-body-control tasks, including with zero-shot transfer to previously unseen motions. Watch episode 72 of RoboPapers, with Michael Cho - Rbt/Acc and Jiafei Duan, now!

How can we build a general-purpose “foundation model” for robot motion? Zhengyi “Zen” Luo joins us to talk about SONIC, which uses motion tracking as a foundational task for humanoid robot control, and scales humanoid control training to 9k GPU hours and 100 million frames worth of data. The result: a model with a generally-useful embedding space that can be controlled by a VLA, or from human video, to perform a wide variety of humanoid whole-body-control tasks, including with zero-shot transfer to previously unseen motions. Watch episode 72 of RoboPapers, with Michael Cho - Rbt/Acc and Jiafei Duan, now!

24,802 views • 3 months ago

Learning robust, general-purpose reward functions for robotics unlocks many potential applications, like on-robot reinforcement learning or dataset validation. However, there’s a question of how to actually train such reward functions. Training success/failure prediction leads to ambiguous signals partway through a demonstration — it’s hard to measure progress — making the method unsuitable for reinforcement learning, among other things. Predicting progress, on the other hand, does not give a good way of using failure data. So why not do both? Robometer combines both progress and preference supervision, resulting in a stable, scalable, and highly general reward learning approach. Anthony Liang Yigit Korkmaz and Jesse Zhang join us to tell us more. Watch Episode #84 of RoboPapers, with Chris Paxton and Jiafei Duan today!

Learning robust, general-purpose reward functions for robotics unlocks many potential applications, like on-robot reinforcement learning or dataset validation. However, there’s a question of how to actually train such reward functions. Training success/failure prediction leads to ambiguous signals partway through a demonstration — it’s hard to measure progress — making the method unsuitable for reinforcement learning, among other things. Predicting progress, on the other hand, does not give a good way of using failure data. So why not do both? Robometer combines both progress and preference supervision, resulting in a stable, scalable, and highly general reward learning approach. Anthony Liang Yigit Korkmaz and Jesse Zhang join us to tell us more. Watch Episode #84 of RoboPapers, with Chris Paxton and Jiafei Duan today!

14,535 views • 1 month ago

Robots need to be able to apply pressure and make contact with objects as needed in order to accomplish their tasks. From compliance to working safely around humans to whole-body manipulation of heavy objects, combining force and position control can dramatically expand the capabilities of robots. This is especially true for legged robots, which have so much ability to exert forces on the world around them. But how do we train robots which can do this? Baoxiong Jia tells us more in our discussion of his team’s recent, Best Paper Award winning work on learning a unified policy for position and force control, called UniFP. To learn more, watch Episode #49 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton.

Robots need to be able to apply pressure and make contact with objects as needed in order to accomplish their tasks. From compliance to working safely around humans to whole-body manipulation of heavy objects, combining force and position control can dramatically expand the capabilities of robots. This is especially true for legged robots, which have so much ability to exert forces on the world around them. But how do we train robots which can do this? Baoxiong Jia tells us more in our discussion of his team’s recent, Best Paper Award winning work on learning a unified policy for position and force control, called UniFP. To learn more, watch Episode #49 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton.

44,803 views • 7 months ago

World models — action-conditioned predictive models of the environment — are an exciting are of research for robots that can be useful both for training and for test-time compute. But video-based world models waste a lot of predictive power on reconstructing pixels, which makes model and data requirements much higher and limits how far out into the future their predictions remain viable. Instead, what if we learned a purely semantic world model, one which predicts which properties will be true about the world after a sequence of actions, without reconstructing the whole images? Jacob Berg tells us more. Watch Episode #53 of RoboPapers now, with Michael Cho - Rbt/Acc and Chris Paxton!

World models — action-conditioned predictive models of the environment — are an exciting are of research for robots that can be useful both for training and for test-time compute. But video-based world models waste a lot of predictive power on reconstructing pixels, which makes model and data requirements much higher and limits how far out into the future their predictions remain viable. Instead, what if we learned a purely semantic world model, one which predicts which properties will be true about the world after a sequence of actions, without reconstructing the whole images? Jacob Berg tells us more. Watch Episode #53 of RoboPapers now, with Michael Cho - Rbt/Acc and Chris Paxton!

39,322 views • 7 months ago

For robots to be useful, they must be able to interact with a wide variety of environments; and yet, scaling interaction data is difficult, expensive, and time consuming. Instead, much research revolves around sim-to-real manipulation — but mostly this has not been mobile manipulation. Recently, though, this has begun to change. Two recent papers from Tairan He and Haoru Xue show us how to unlock the potential of this technique, building policies which, without any real data at all, can move objects around in the world and open doors in the real world with a humanoid robot. Watch Episode #60 of RoboPapers now to learn more, hosted by Chris Paxton and Jiafei Duan. In this episode, we cover two papers:. First is VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation; and second is DoorMan: Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer.

For robots to be useful, they must be able to interact with a wide variety of environments; and yet, scaling interaction data is difficult, expensive, and time consuming. Instead, much research revolves around sim-to-real manipulation — but mostly this has not been mobile manipulation. Recently, though, this has begun to change. Two recent papers from Tairan He and Haoru Xue show us how to unlock the potential of this technique, building policies which, without any real data at all, can move objects around in the world and open doors in the real world with a humanoid robot. Watch Episode #60 of RoboPapers now to learn more, hosted by Chris Paxton and Jiafei Duan. In this episode, we cover two papers:. First is VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation; and second is DoorMan: Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer.

30,767 views • 5 months ago

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!

23,905 views • 4 months ago

Every home is different. That means that to build a useful home robot, we must be able to perform zero-shot generalization on a wide range of tasks. Humanoid company 1X has a solution: world models. 1X Director of Evaluations Daniel Ho joins us on RoboPapers to talk about: - why world models are the future for scaling robot learning - how to use world models for robot control - what world models unlock for evaluating robot model performance - how we can hill-climb from here to general purpose robots Watch Episode #61 of RoboPapers, with Michael Cho - Rbt/Acc and Chris Paxton, now!

Every home is different. That means that to build a useful home robot, we must be able to perform zero-shot generalization on a wide range of tasks. Humanoid company 1X has a solution: world models. 1X Director of Evaluations Daniel Ho joins us on RoboPapers to talk about: - why world models are the future for scaling robot learning - how to use world models for robot control - what world models unlock for evaluating robot model performance - how we can hill-climb from here to general purpose robots Watch Episode #61 of RoboPapers, with Michael Cho - Rbt/Acc and Chris Paxton, now!

27,567 views • 5 months ago