
RoboPapers
@RoboPapers • 5,515 subscribers
@chris_j_paxton, @micoolcho & @DJiafei geeking out weekly with authors of robotics AI papers. On YouTube / X / Spotify / Substack
Videos

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about Mimic Robotic. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!
RoboPapers46,020 views • 18 days ago

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang @ CVPR joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!
RoboPapers16,152 views • 8 days ago

Robotics has changed dramatically over the last eight years. Ted Xiao has been involved in the cutting edge of robot learning through this period, spending those eight years at Google Brain/Google Deepmind. And he’s identified three eras of robot learning. These eras are: - The Era of Existence Proofs - trying different methods like QT-Opt, on-robot RL - The Era of Foundation Models - transitioning to data collection and clean objectives (i.e. supervised learning) - The Era of Scaling - orders of magnitude more data and larger models, enabling reasoning, long-horizon actions, and cross-embodiment transfer Watch Episode 78 of RoboPapers, with Michael Cho - Rbt/Acc and Jiafei Duan@CVPR2026 to learn more!
RoboPapers36,360 views • 1 month ago

Training robot foundation models faces two key hurdles: how to get enough data to train an effective model, and how to make sure that new skills can be acquired quickly. The team at Rhoda AI believes that the answer is training Direct Video Action models from web data. Web data is plentiful, to the point where Rhoda can train their base model on hundreds of years of video data. And then, with the addition of robot data, they can quickly adapt it to new tasks with as little as 20 hours of in-domain data, performing complex, multi-step manipulation tasks with their purpose-built video foundation model. Tongzhou Mu 🤖🦾🦿 Eric Chan and Changan Chen joined us to talk more about their approach. Watch Episode #79 of RoboPapers, with Michael Cho - Rbt/Acc, Chris Paxton, and Jiafei Duan, to learn more!
RoboPapers23,976 views • 1 month ago

Robots, unfortunately, tend to be expensive. And finding a robot that’s both capable of performing a wide variety of mobile manipulation tasks, and is affordable and “hackable”, is extremely difficult. Many different problems need to be addressed, from arm control to navigation to integrating your data collection strategy into hardware design. This can make it difficult for all but the most well-funded teams to “scale” real-world robotics research. Fortunately, the team behind Build Your Own Robot has a solution. Manan Anjaria, Mahi Shafiullah 🏠🤖,Jeff Cui, and Enes Erciyes joined us to talk about how they build a fully open-source mobile manipulator out of off-the-shelf parts, which has humanlike range of motion, and can perform a wide variety of tasks, all while being only roughly $10,000 to build. Watch Episode 71 of RoboPapers, with Michael Cho - Rbt/Acc and Chris Paxton, today to learn more!
RoboPapers34,151 views • 2 months ago

Teaching robots to perform dexterous manipulation tasks currently requires teleoperation, which limits demonstration quality, speed, and scalability. Instead, why not use human videos? The problem is that a human hand isn’t a robot hand, so data must be retargeted using simulation to resolve issues like collisions and interpenetration when controlling the hand. In VideoManip, Hongyi Chen and co-authors built a system to solve this problem, taking in RGB videos of humans performing manipulation tasks and using them to create accurate simulations with which to learn robot policies. Watch episode #73 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton, now to learn more!
RoboPapers27,249 views • 1 month ago

With enough data, robots and AI can learn “world models” that let them predict the results of their actions. These models are a way to learn how embodied AI agents can perform a wide variety of useful tasks — but they require a huge amount of data. The team at General Intuition General Intuition has a solution: use data from video games! Games teach movement, problem solving, and complex spatial reasoning, and they come in a staggering diversity of forms, covering a wide variety of problems. What’s more, the captured data is high-quality, without the noise or annotation error that can come from We sat down with Pim de Witte and Adam Jelley from the General Intuition team to learn more about their history, their plans, and their philosophy.
RoboPapers85,550 views • 6 months ago

Robots has a data problem, in that robotics data is rare. While human video is quite common, it’s not usually directly usable for robots for a variety of reasons, most significantly that it’s missing explicit, accurate robot actions. Instead, Jeremy Collins proposes that we predict keypoint trajectories — basically, how any given point in an object will move as a robot performs a task. This lets us use action-free human video to train robot skills. Learn more by watching Episode #37 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton now.
RoboPapers88,875 views • 7 months ago

How can we build a general-purpose “foundation model” for robot motion? Zhengyi “Zen” Luo joins us to talk about SONIC, which uses motion tracking as a foundational task for humanoid robot control, and scales humanoid control training to 9k GPU hours and 100 million frames worth of data. The result: a model with a generally-useful embedding space that can be controlled by a VLA, or from human video, to perform a wide variety of humanoid whole-body-control tasks, including with zero-shot transfer to previously unseen motions. Watch episode 72 of RoboPapers, with Michael Cho - Rbt/Acc and Jiafei Duan, now!
RoboPapers24,649 views • 1 month ago

Robots need to be able to apply pressure and make contact with objects as needed in order to accomplish their tasks. From compliance to working safely around humans to whole-body manipulation of heavy objects, combining force and position control can dramatically expand the capabilities of robots. This is especially true for legged robots, which have so much ability to exert forces on the world around them. But how do we train robots which can do this? Baoxiong Jia tells us more in our discussion of his team’s recent, Best Paper Award winning work on learning a unified policy for position and force control, called UniFP. To learn more, watch Episode #49 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton.
RoboPapers44,774 views • 6 months ago

Benchmarking, evaluating, and developing robotics code is difficult, and part of this is because no simulator really reflects the diversity and scale of real embodiments. Enter MolmoSpaces from AI2: a massive open ecosystem with a range of 230,000 handcrafted and procedurally-generated home environments, including 48,000 manipulable objects. Crucially, MolmoSpaces provides simulation environments which work for both navigation and manipulation. We talked to the team: Yejin Kim, Omar Rayyan, and Max Argus, to tell us more. Watch Episode 69 of RoboPapers, with Michael Cho - Rbt/Acc and Jiafei Duan@CVPR2026, now!
RoboPapers20,934 views • 2 months ago

Co-training has become a key part of the recipe for training large robotics models; it means that you mix some proportion of real robot data with other data sources, like simulation or egocentric human video data. This is especially important because robotics data tends to lack diversity which can be somewhat compensated for by the inclusion of these other modalities. And yet there has not been a sizable study on what constitute good practices for cotraining until now! We talk to Fanqi Lin and Jose Barreiros about their new work, a massive study which evaluated 89 policies over thousands of rollouts to tell us which forms of co-training were most useful for robotics. Watch episode 70 of RoboPapers, with Michael Cho and Chris Paxton, now!
RoboPapers19,152 views • 2 months ago

World models — action-conditioned predictive models of the environment — are an exciting are of research for robots that can be useful both for training and for test-time compute. But video-based world models waste a lot of predictive power on reconstructing pixels, which makes model and data requirements much higher and limits how far out into the future their predictions remain viable. Instead, what if we learned a purely semantic world model, one which predicts which properties will be true about the world after a sequence of actions, without reconstructing the whole images? Jacob Berg tells us more. Watch Episode #53 of RoboPapers now, with Michael Cho - Rbt/Acc and Chris Paxton!
RoboPapers39,254 views • 5 months ago

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!
RoboPapers23,883 views • 3 months ago

For robots to be useful, they must be able to interact with a wide variety of environments; and yet, scaling interaction data is difficult, expensive, and time consuming. Instead, much research revolves around sim-to-real manipulation — but mostly this has not been mobile manipulation. Recently, though, this has begun to change. Two recent papers from Tairan He and Haoru Xue show us how to unlock the potential of this technique, building policies which, without any real data at all, can move objects around in the world and open doors in the real world with a humanoid robot. Watch Episode #60 of RoboPapers now to learn more, hosted by Chris Paxton and Jiafei Duan. In this episode, we cover two papers:. First is VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation; and second is DoorMan: Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer.
RoboPapers30,767 views • 4 months ago

Every home is different. That means that to build a useful home robot, we must be able to perform zero-shot generalization on a wide range of tasks. Humanoid company 1X has a solution: world models. 1X Director of Evaluations Daniel Ho joins us on RoboPapers to talk about: - why world models are the future for scaling robot learning - how to use world models for robot control - what world models unlock for evaluating robot model performance - how we can hill-climb from here to general purpose robots Watch Episode #61 of RoboPapers, with Michael Cho - Rbt/Acc and Chris Paxton, now!
RoboPapers26,334 views • 4 months ago

Teleoperating a robot is hard. This means that when performing a robot task via teleoperation — say, to collect examples for training a robot policy — it’s almost unavoidably slower than you would like, below either the capabilities of the human expert on their own or the robot performing the task. Wouldn’t it be great if there was a way to fix this? Unfortunately, it’s harder than it looks. You can’t just execute faster, as this alters the distribution of environment states the policy will encounter. Nadun Ranawaka Arachchige and Zhenyang Chen propose Speed-Adaptive Imitation Learning (SAIL), which adds error-adaptive guidance, adapts execution speed according to task structure, predicts controller-invariant action targets to ensure robustness across execution speeds, and explicitly models delays from, for example, sensor latency. Watch episode #59 of RoboPapers, with Michael Cho - Rbt/Acc and Chris Paxton to learn more!
RoboPapers27,970 views • 4 months ago

The holy grail of robotics is to be able to perform previously-unseen, out-of-distribution manipulation tasks “zero shot” in a new environment. NovaFlow proposes an approach which (1) generates a video, (2) computes predicted flow — how points move through the scene — and (3) uses this flow as an objective to generate a motion. Using this procedure, NovaFlow generates motions in unseen scenes, for unseen tasks, and can transfer across embodiments. To learn more, we are joined by Hongyu Li and Jiahui Fu from RAI. Watch Episode #63 of RoboPapers with Chris Paxton and Michael Cho - Rbt/Acc now to learn more!
RoboPapers21,161 views • 3 months ago

Teaching robots from human video is an important part of overcoming the “data gap” in robotics, but many of the details still need to be worked out. Homanga Bharadhwaj tells us about two recent research papers, Gen2Act and Spider, which go over different aspects of the problem: Gen2Act uses generative video models to create a reference of how a task should be performed given a language prompt; then, it uses a multi-purpose policy that can “translate” from human video to robot motion. However, Gen2Act has its limitations, in particular when it comes to dexterous, contact-rich tasks. That’s where SPIDER comes in: it uses human data together with simulation to train policies across many different humanoid hands and datasets. Also of note is that this is our first episode with our new rotating co-host, Jiafei Duan. To learn more, watch Episode #57 of RoboPapers now, with Chris Paxton and Jiafei Duan!
RoboPapers24,286 views • 5 months ago