Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Text-to-image generative models, meet robotics! We present ROSIE: Scaling RObot Learning with Semantically Imagined Experience, where we augment real robotics data with semantically imagined scenarios for downstream manipulation learning. Website: 🧵👇

Fei Xia

8,900 subscribers

196,378 views • 3 years ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

16 Comments

Fei Xia3 years ago

It is incredibly resource consuming to collect real-world robotics data, for example it takes our robot fleet of 13 mobile manipulators 17 months to collect 130k manipulation episodes. Can we extend these data “for free” by augmenting them?

Fei Xia3 years ago

We present a system that automatically augments robot training data. All you need to tell it is a source task “place coke can into top drawer” and a target task “place coke can into cluttered top drawer”. The system outputs a few augmentation schemes, including masks and edits.

Fei Xia3 years ago

We use an open-vocabulary image segmentation model derived from OWL-ViT( for mask generation, and Imagen-Editor ( for image inpainting. We then train an RT-1 ( policy on top of the mixed data.

Fei Xia3 years ago

A few interesting things we found along the way:

Fei Xia3 years ago

1) We can complete tasks **only seen through** diffusion models. For example, we augment “putting objects in drawer” tasks into “putting objects in sink”, by reimagining the drawer as a metal sink. The policy trained on the mixed data is able to put objects into the sink!

Fei Xia3 years ago

2) Generative data augmentation works for high-dimensional continuous action space and image frames. Our action space is the end-effector delta pose in 3D space. And the input is image frames. This is in contrast with other works in diffusion augmentation for perception.

Fei Xia3 years ago

Although our work doesn’t guarantee temporal consistency, the high-capability architecture (RT-1) is able to handle the flickering in the frames and still generalize to the real world. For example, here is the training data and real-world rollout for "picking up <color> cloth"

Fei Xia3 years ago

3) The augmentation is photorealistic and simulates rich visual nuances. Previously we have explored the knowledge and information encoded in vision-language foundation models in our work Socratic Models and Inner Monologue. This time we investigate the other side of the coin.

Fei Xia3 years ago

There is vast knowledge encoded in those diffusion models and to our surprise, there are even signs of life that they understand some physics by modeling the image formation process, see how the generated cloth has folds within the gripper pinch. Warrant further investigation

Fei Xia3 years ago

We think there are a few directions that could be further explored. It seems that diffusion models can act as a supplementary source of data to simulation. We know that we can do sim-to-real, maybe we can also do diffusion-to-real, or combine both?

Fei Xia3 years ago

Text guidance is very important here because we can synthesize data for specific scenarios / long tail distribution of low data regimes. Essentially we get image and text alignment for free, which is fundamentally different from previous augmentation methods.

Fei Xia3 years ago

Our current method is somewhat limited by temporal consistency, we aim at incorporating video diffusion models, and masked transformers for better data augmentation. It's also possible to incorporate ControlNet to better harness those models.

Fei Xia3 years ago

The generative AI + robotics is a vibrant community. Shout out to our concurrent works GenAug and CACTI.

Fei Xia3 years ago

This work is lead by @TianheYu, in collaboration with an amazing team @xiao_ted, Austin Stone, @JonathanTompson Anthony Brohan, Su Wang, @brian_ichter and @hausman_k 🙌🙌

Fei Xia3 years ago

Visit out website to learn more, also checkout our interactive demo on the website. While we used Imagen-Editor for our work, the method is compatible with stable diffusion as well, we aim at open sourcing a specific version soon.

MightyBot1 year ago

🧠 Unified Search. Smarter Meetings. Effortless CRM. MightyBot is your AI agent platform for seamless workflows—record meetings, automate CRM updates, and find answers across apps in seconds. 🌟 Focus on what matters. We'll handle the grind.

Related Videos

Robot foundation models are limited by costly real data, while simulation data is plentiful but visually mismatched to reality. We present Point Bridge, a method that enables zero-shot sim-to-real transfer for robot learning with minimal visual alignment.

Robot foundation models are limited by costly real data, while simulation data is plentiful but visually mismatched to reality. We present Point Bridge, a method that enables zero-shot sim-to-real transfer for robot learning with minimal visual alignment.

Siddhant Haldar

19,818 views • 4 months ago

So I heard we need more data for robot learning :) Purely real world teleop is expensive and slow, making large scale data collection challenging. I’ve been excited about getting more data into robot learning, going beyond just real-world teleop data. To this end, we’ve been scaling up data generation with RL in realistic simulations generated on the fly from crowdsourced videos. Enables realistic data collection, much more cheaply than purely real world teleop. Importantly, data collection becomes even*cheaper* with more environments, allowing training with over 100x more data. Transfers to real robots for generalizable manipulation. A 🧵 (1/N)

So I heard we need more data for robot learning :) Purely real world teleop is expensive and slow, making large scale data collection challenging. I’ve been excited about getting more data into robot learning, going beyond just real-world teleop data. To this end, we’ve been scaling up data generation with RL in realistic simulations generated on the fly from crowdsourced videos. Enables realistic data collection, much more cheaply than purely real world teleop. Importantly, data collection becomes evencheaper with more environments, allowing training with over 100x more data. Transfers to real robots for generalizable manipulation. A 🧵 (1/N)

Abhishek Gupta

13,345 views • 1 year ago

Everyone is scaling VLAs with more robot data. TiPToP shows another path. No robot training, no policy learning. Just RGB + language → 3D scene → GPU TAMP planner → trajectory. Foundation models + planning alone can run real manipulation tasks.

Everyone is scaling VLAs with more robot data. TiPToP shows another path. No robot training, no policy learning. Just RGB + language → 3D scene → GPU TAMP planner → trajectory. Foundation models + planning alone can run real manipulation tasks.

Robots Digest 🤖

10,362 views • 3 months ago

Introducing GEN-1. Our latest milestone in scaling robot learning. We believe it to be the first general-purpose AI model to master simple physical tasks. 99% success rates, 3x faster speeds, adapts in real time to unexpected scenarios, w/ only 1 hour of robot data. More🧵👇

Introducing GEN-1. Our latest milestone in scaling robot learning. We believe it to be the first general-purpose AI model to master simple physical tasks. 99% success rates, 3x faster speeds, adapts in real time to unexpected scenarios, w/ only 1 hour of robot data. More🧵👇

Generalist

377,841 views • 2 months ago

Google is offering a Generative AI Learning Path with 10 courses for FREE! - Intro to Generative AI - Intro to LLMs - Intro to Image Generation - Encoder-Decoder Architecture - Transformer Models and more A Thread 🧵👇

Google is offering a Generative AI Learning Path with 10 courses for FREE! - Intro to Generative AI - Intro to LLMs - Intro to Image Generation - Encoder-Decoder Architecture - Transformer Models and more A Thread 🧵👇

Afiz ⚡️

249,552 views • 3 years ago

Generative models (diffusion/flow) are taking over robotics 🤖. But do we really need to model the full action distribution to control a robot? We suspected the success of Generative Control Policies (GCPs) might be "Much Ado About Noising." We rigorously tested the myths. 🧵👇

Generative models (diffusion/flow) are taking over robotics 🤖. But do we really need to model the full action distribution to control a robot? We suspected the success of Generative Control Policies (GCPs) might be "Much Ado About Noising." We rigorously tested the myths. 🧵👇

Chaoyi Pan

111,340 views • 6 months ago

Data is the core bottleneck for robotics, here is the solution: Introducing MicroAGI00: We open sourced 1M+ frames of egocentric human long horizon mobile manipulation data. Scaling to 300M by Dec '25 and 100B frames Q4 2026 dm if you want your robot to wipe your office 🧵

Data is the core bottleneck for robotics, here is the solution: Introducing MicroAGI00: We open sourced 1M+ frames of egocentric human long horizon mobile manipulation data. Scaling to 300M by Dec '25 and 100B frames Q4 2026 dm if you want your robot to wipe your office 🧵

Bercan

39,646 views • 8 months ago

New Episode: Carolina Parada leads the robotics team at Google DeepMind. Carolina believes in a future with a broad, rich ecosystem of diverse robot types, where AI is smart enough to embody any robot. We discuss her journey, advancements in Gemini Robotics 1.5, cross-embodiment transfer, RL versus imitation, humanoids, world models, scaling laws, societal concerns, and more. 1:17 Introduction 1:38 Career journey and inspirations 3:50 Reaction to the 2022 ChatGPT launch 4:53 Google DeepMind's robotics mission 9:13 Key upgrades in Gemini Robotics 1.5 14:06 Agentic system's web usage 15:33 Robotics data-gathering methods 16:57 VLA vs. reasoning model training differences 18:22 Convergence of action and reasoning models 19:35 Cross-embodiment transfer 22:34 Learning directly from humans 24:20 Generalization challenges 27:01 Imitation versus reinforcement learning 28:51 Use of world models in robotics 30:31 Applications and testing of Gemini Robotics 1.5 33:04 Do humanoids deserve special focus? 35:06 Home humanoids timeline & limiting factors 37:19 Scaling laws in robotics versus self-driving 38:46 Value of learning classical robotics approaches 40:38 Ethical issues and societal impact

New Episode: Carolina Parada leads the robotics team at Google DeepMind. Carolina believes in a future with a broad, rich ecosystem of diverse robot types, where AI is smart enough to embody any robot. We discuss her journey, advancements in Gemini Robotics 1.5, cross-embodiment transfer, RL versus imitation, humanoids, world models, scaling laws, societal concerns, and more. 1:17 Introduction 1:38 Career journey and inspirations 3:50 Reaction to the 2022 ChatGPT launch 4:53 Google DeepMind's robotics mission 9:13 Key upgrades in Gemini Robotics 1.5 14:06 Agentic system's web usage 15:33 Robotics data-gathering methods 16:57 VLA vs. reasoning model training differences 18:22 Convergence of action and reasoning models 19:35 Cross-embodiment transfer 22:34 Learning directly from humans 24:20 Generalization challenges 27:01 Imitation versus reinforcement learning 28:51 Use of world models in robotics 30:31 Applications and testing of Gemini Robotics 1.5 33:04 Do humanoids deserve special focus? 35:06 Home humanoids timeline & limiting factors 37:19 Scaling laws in robotics versus self-driving 38:46 Value of learning classical robotics approaches 40:38 Ethical issues and societal impact

The Humanoid Hub

71,481 views • 8 months ago

Humanoid robotics is entering a new phase： Xiaomi Robotics 0 (XR-0) explores learning manipulation directly from large-scale human videos — aligning video understanding with robot embodiment to generate executable actions. Watch humans->learn representations ->map to robot control. Robot data is scarce. Internet video is not. If video pretraining transfers reliably into embodied control, the scaling law of robotics changes. Deployment speed becomes a function of model alignment, not just hardware iteration. We’re moving from engineered skills to learned behavior. That’s a structural shift. project:

Humanoid robotics is entering a new phase： Xiaomi Robotics 0 (XR-0) explores learning manipulation directly from large-scale human videos — aligning video understanding with robot embodiment to generate executable actions. Watch humans->learn representations ->map to robot control. Robot data is scarce. Internet video is not. If video pretraining transfers reliably into embodied control, the scaling law of robotics changes. Deployment speed becomes a function of model alignment, not just hardware iteration. We’re moving from engineered skills to learned behavior. That’s a structural shift. project:

CyberRobo

18,356 views • 4 months ago

.Sureform (YC X25) connects real-world workplaces with robotics labs to collect task-specific training data for robot foundation models. Congrats on the launch, Ananth Kashyap!

.Sureform (YC X25) connects real-world workplaces with robotics labs to collect task-specific training data for robot foundation models. Congrats on the launch, Ananth Kashyap!

Y Combinator

35,876 views • 4 months ago

How to use simulation data for real-world robot manipulation? We present sim-and-real co-training, a simple recipe for manipulation. We demonstrate that sim data can significantly enhance real-world performance, even with notable differences between the sim and the real. (1/n)

How to use simulation data for real-world robot manipulation? We present sim-and-real co-training, a simple recipe for manipulation. We demonstrate that sim data can significantly enhance real-world performance, even with notable differences between the sim and the real. (1/n)

Zhenyu Jiang

44,263 views • 1 year ago

Introducing 𝐃𝐫𝐞𝐚𝐦𝐆𝐞𝐧! We got humanoid robots to perform totally new 𝑣𝑒𝑟𝑏𝑠 in new environments through video world models. We believe video world models will solve the data problem in robotics. Bringing the paradigm of scaling human hours to GPU hours. Quick 🧵

Introducing 𝐃𝐫𝐞𝐚𝐦𝐆𝐞𝐧! We got humanoid robots to perform totally new 𝑣𝑒𝑟𝑏𝑠 in new environments through video world models. We believe video world models will solve the data problem in robotics. Bringing the paradigm of scaling human hours to GPU hours. Quick 🧵

Joel Jang

117,493 views • 1 year ago

We might be solving the wrong problem in robotics. That’s what this makes clear. UMI → Universal Manipulation Interface A simple $400 gripper that lets you teach robots by demonstration. You hold it like a tool. Show the task. The robot learns. No teleoperation. No expensive hardware. No robot-specific data. Stanford open-sourced everything → hardware, code, datasets. What stands out to me is the bottleneck. Not algorithms. Data. Teleoperation → ~35 demos/hour UMI → ~111 demos/hour And the data transfers across robots → UR5, Franka, others. The design is surprisingly practical: → GoPro fisheye lens (155° FOV) + mirrors for depth → SLAM + IMU for precise 6DoF tracking → latency matching for dynamic tasks → diffusion policies for multimodal actions Then it scales. Cheng Chi takes this further with Sunday Robotics (with Tony Zhao). A $200 glove → deployed in 500+ homes → ~10 million real-world interactions. Not lab data. Real human behavior. Their robot learns dishes, laundry, espresso → with zero robot-specific data. This is where the shift becomes obvious. From training robots in controlled environments → to learning directly from humans at scale So here’s the real question: Will robotics be unlocked by better models… or by unlocking data? #ArtificialIntelligence #Robotics #AI #Innovation #FutureOfWork

We might be solving the wrong problem in robotics. That’s what this makes clear. UMI → Universal Manipulation Interface A simple $400 gripper that lets you teach robots by demonstration. You hold it like a tool. Show the task. The robot learns. No teleoperation. No expensive hardware. No robot-specific data. Stanford open-sourced everything → hardware, code, datasets. What stands out to me is the bottleneck. Not algorithms. Data. Teleoperation → ~35 demos/hour UMI → ~111 demos/hour And the data transfers across robots → UR5, Franka, others. The design is surprisingly practical: → GoPro fisheye lens (155° FOV) + mirrors for depth → SLAM + IMU for precise 6DoF tracking → latency matching for dynamic tasks → diffusion policies for multimodal actions Then it scales. Cheng Chi takes this further with Sunday Robotics (with Tony Zhao). A $200 glove → deployed in 500+ homes → ~10 million real-world interactions. Not lab data. Real human behavior. Their robot learns dishes, laundry, espresso → with zero robot-specific data. This is where the shift becomes obvious. From training robots in controlled environments → to learning directly from humans at scale So here’s the real question: Will robotics be unlocked by better models… or by unlocking data? #ArtificialIntelligence #Robotics #AI #Innovation #FutureOfWork

Pascal Bornet

185,867 views • 2 months ago

Today we are excited to open up Neuracore to the academic community! Neuracore is a new data foundation built to accelerate robot learning by removing one of the field’s biggest bottlenecks: capturing and working with high-fidelity multimodal robotics data. For the first time, researchers can store, view, and work with robotics data in a cloud-native system built specifically for large-scale learning, and we are making this core platform completely free for academia. The platform lets teams capture every sensor at its native rate, store and visualize data without loss, and then train and deploy models locally using our open-source code (Link in the comments). We are rolling out access to select academic institutions first. Anyone with an academic email can sign up, and if your institution is not part of the initial rollout, you will be able to join the waitlist directly. Beyond providing this infrastructure, we see an opportunity to build a global community where engineers and researchers can share, collaborate, and advance the frontier of robot learning together. Supported by our recent $3M pre-seed round led by Earlybird VC, we are excited to take this mission even further. Our long-term goal is for Neuracore to become the natural home for cutting-edge robot learning algorithms and real-world robotics experimentation, helping accelerate the next wave of Physical AI.

Today we are excited to open up Neuracore to the academic community! Neuracore is a new data foundation built to accelerate robot learning by removing one of the field’s biggest bottlenecks: capturing and working with high-fidelity multimodal robotics data. For the first time, researchers can store, view, and work with robotics data in a cloud-native system built specifically for large-scale learning, and we are making this core platform completely free for academia. The platform lets teams capture every sensor at its native rate, store and visualize data without loss, and then train and deploy models locally using our open-source code (Link in the comments). We are rolling out access to select academic institutions first. Anyone with an academic email can sign up, and if your institution is not part of the initial rollout, you will be able to join the waitlist directly. Beyond providing this infrastructure, we see an opportunity to build a global community where engineers and researchers can share, collaborate, and advance the frontier of robot learning together. Supported by our recent $3M pre-seed round led by Earlybird VC, we are excited to take this mission even further. Our long-term goal is for Neuracore to become the natural home for cutting-edge robot learning algorithms and real-world robotics experimentation, helping accelerate the next wave of Physical AI.

Neuracore

40,620 views • 7 months ago

Robots are the bottleneck in scaling robotics, and learning from human video promises to solve it. But how can chaotic human data ever measure up to sanitized, lab-made teleoperation data? Introducing Do as I Do: establishing a much needed correspondence between human videos and dexterous robot data. Some fun insights below: 🧵

Robots are the bottleneck in scaling robotics, and learning from human video promises to solve it. But how can chaotic human data ever measure up to sanitized, lab-made teleoperation data? Introducing Do as I Do: establishing a much needed correspondence between human videos and dexterous robot data. Some fun insights below: 🧵

Mahi Shafiullah 🏠🤖

89,537 views • 11 days ago

A big part of scaling robot learning to solve real-world problems is that we somehow need to get enough diverse, high-quality data to train our robots to perform useful things. GPT and its fellow large language models were bootstrapped and proved out on a massive dataset of real-world language data. Unfortunately, despite our best efforts, similarly massive datasets don’t really exist for robotics — so, in our unending pursuit of high-quality, useful data, we turn to simulation. I compared a couple recent works on sim-to-real robot manipulation, which discuss how to train perception-driven manipulation policies in simulation, in such a way that they’re useful in the real world. - DextraH-RGB, from NVIDIA - Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation, also from NVIDIA — specifically the GEAR lab - Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids, another GEAR lab paper - Local Policies Enable Zero-shot Long-Horizon Manipulation, from CMU (video from DextrAH-RGB)

A big part of scaling robot learning to solve real-world problems is that we somehow need to get enough diverse, high-quality data to train our robots to perform useful things. GPT and its fellow large language models were bootstrapped and proved out on a massive dataset of real-world language data. Unfortunately, despite our best efforts, similarly massive datasets don’t really exist for robotics — so, in our unending pursuit of high-quality, useful data, we turn to simulation. I compared a couple recent works on sim-to-real robot manipulation, which discuss how to train perception-driven manipulation policies in simulation, in such a way that they’re useful in the real world. - DextraH-RGB, from NVIDIA - Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation, also from NVIDIA — specifically the GEAR lab - Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids, another GEAR lab paper - Local Policies Enable Zero-shot Long-Horizon Manipulation, from CMU (video from DextrAH-RGB)

Chris Paxton

20,486 views • 1 year ago

In-hand object manipulation is a dexterity litmus test for robot hands. Our new system in Science Robotics Science Robotics can dynamically reorient many different objects in hand in the air. 📚Project website: 🧑‍💻Code: 🧵1/n

In-hand object manipulation is a dexterity litmus test for robot hands. Our new system in Science Robotics Science Robotics can dynamically reorient many different objects in hand in the air. 📚Project website: 🧑‍💻Code: 🧵1/n

Tao Chen

30,666 views • 2 years ago

Robotics has changed dramatically over the last eight years. Ted Xiao has been involved in the cutting edge of robot learning through this period, spending those eight years at Google Brain/Google Deepmind. And he’s identified three eras of robot learning. These eras are: - The Era of Existence Proofs - trying different methods like QT-Opt, on-robot RL - The Era of Foundation Models - transitioning to data collection and clean objectives (i.e. supervised learning) - The Era of Scaling - orders of magnitude more data and larger models, enabling reasoning, long-horizon actions, and cross-embodiment transfer Watch Episode 78 of RoboPapers, with Michael Cho - Rbt/Acc and Jiafei Duan to learn more!

Robotics has changed dramatically over the last eight years. Ted Xiao has been involved in the cutting edge of robot learning through this period, spending those eight years at Google Brain/Google Deepmind. And he’s identified three eras of robot learning. These eras are: - The Era of Existence Proofs - trying different methods like QT-Opt, on-robot RL - The Era of Foundation Models - transitioning to data collection and clean objectives (i.e. supervised learning) - The Era of Scaling - orders of magnitude more data and larger models, enabling reasoning, long-horizon actions, and cross-embodiment transfer Watch Episode 78 of RoboPapers, with Michael Cho - Rbt/Acc and Jiafei Duan to learn more!

RoboPapers

36,520 views • 1 month ago

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

Hongyu Li

105,471 views • 8 months ago

Over the last few months, we’ve been thinking about how to learn from “off-domain” data - data from non-robot sources like video or simulation. These data sources are not quite good enough to learn policies (even monolithic VLA models) directly, but they still contain lots of information that can be useful for generalizable robot control. How can we develop robot learning models that are able to make use of this type of data for generalizable control? In new work, that we call HAMSTER, we show that VLMs can be useful for enabling robotic learning from off-domain data, but specifically when used through hierarchical VLA architectures. We show that this class of models can learn generalizable robot policies for the real world from large-scale, off-domain data. A 🧵 (1/10)

Over the last few months, we’ve been thinking about how to learn from “off-domain” data - data from non-robot sources like video or simulation. These data sources are not quite good enough to learn policies (even monolithic VLA models) directly, but they still contain lots of information that can be useful for generalizable robot control. How can we develop robot learning models that are able to make use of this type of data for generalizable control? In new work, that we call HAMSTER, we show that VLMs can be useful for enabling robotic learning from off-domain data, but specifically when used through hierarchical VLA architectures. We show that this class of models can learn generalizable robot policies for the real world from large-scale, off-domain data. A 🧵 (1/10)

Abhishek Gupta

11,994 views • 1 year ago