Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

How can robots learn generalizable manipulation skills for diverse objects? Going beyond pick-and-place, our recent work “HACMan” enables complex interactions for unseen objects, such as flipping, pushing, or tilting, using spatial action maps + RL with point clouds. (w/ @MetaAI)

Wenxuan Zhou

3,109 subscribers

49,857 views • 3 years ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

10 Comments

Wenxuan Zhou3 years ago

We find that defining the right action space is crucial for learning a manipulation task. We explore an object-centric action representation in RL that consists of selecting a contact location on the object and a set of parameters describing the robot's movement after contact.

Wenxuan Zhou3 years ago

Our object-centric action representation has two benefits. It is… 1. Spatially-grounded: because the learned contact location is selected from the observed object points. 2. Temporally-abstracted: because we focus only on learning the contact-rich portions of the action.

Wenxuan Zhou3 years ago

With off-policy RL, given a point cloud, the actor outputs per-point motion parameters (Actor Map) while the critic outputs per-point Q-values (Critic Map). The Critic Map is not only used to update the actor but also serves as the scores for selecting the contact location.

Wenxuan Zhou3 years ago

We evaluate our method with a 6D object pose alignment task with randomized initial poses, randomized 6D goals, and diverse unseen objects in both simulation and in the real world.

Wenxuan Zhou3 years ago

HACMan outperforms the baselines, with a larger margin for more challenging tasks. Success rates for simple tasks - pushing a single object to an in-plane goal - are high for all methods, but only HACMan achieves high success rates for 6D alignment of diverse objects.

Wenxuan Zhou3 years ago

Check out the paper and the website for more information and video results showing HACMan generalizing to different objects and goals! w/@bwww08, Fan Yang, @chris_j_paxton, @davheld

Brett Adcock3 years ago

@MetaAI Congrats, thanks for sharing.

Arnav Wadhwa3 years ago

@MetaAI Amazing work! I’m wondering about the challenges/improvements tradeoff when using a human-hand like end effector with 5 fingers. Curious to know what you think

Wenxuan Zhou3 years ago

@MetaAI Multi-fingered hands may allow a wider variety of motions and have more tolerance (picking an object with a multi-fingered hand can be less sensitive to object shapes than a simple gripper). However, they are more expensive, easier to break, and have a bigger sim2real gap.

Sasha Salter2 years ago

@MetaAI Great use of temporal abstraction to simplify learning!

Related Videos

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

Xiao Ma

46,323 views • 1 year ago

In just ~3 months, as a solo founder with no prior robotics experience, General Trajectory trained a foundation model for dexterous manipulation that lets humanoid robots pick up unseen objects and perform real-world work. It generalizes to novel objects and scenes, including cases where prior SoTA models achieve 0% success. Congrats on the launch Joshua!

In just ~3 months, as a solo founder with no prior robotics experience, General Trajectory trained a foundation model for dexterous manipulation that lets humanoid robots pick up unseen objects and perform real-world work. It generalizes to novel objects and scenes, including cases where prior SoTA models achieve 0% success. Congrats on the launch Joshua!

Y Combinator

68,605 views • 6 months ago

Drop the camera, and bang -- the robots do the job without camera calibration or additional data collection. We want robots to manipulate anywhere! Our new work, **Maniwhere**, enables robots to manipulate objects from any camera view with any background. Isn't this the generalization every roboticist is looking for? How do we achieve this with sim2real? You can check this thread from Zhecheng Yuan ! And the website is here:

Drop the camera, and bang -- the robots do the job without camera calibration or additional data collection. We want robots to manipulate anywhere! Our new work, Maniwhere, enables robots to manipulate objects from any camera view with any background. Isn't this the generalization every roboticist is looking for? How do we achieve this with sim2real? You can check this thread from Zhecheng Yuan ! And the website is here:

Huazhe Harry Xu

27,397 views • 2 years ago

Robots need to be able to apply pressure and make contact with objects as needed in order to accomplish their tasks. From compliance to working safely around humans to whole-body manipulation of heavy objects, combining force and position control can dramatically expand the capabilities of robots. This is especially true for legged robots, which have so much ability to exert forces on the world around them. But how do we train robots which can do this? Baoxiong Jia tells us more in our discussion of his team’s recent, Best Paper Award winning work on learning a unified policy for position and force control, called UniFP. To learn more, watch Episode #49 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton.

Robots need to be able to apply pressure and make contact with objects as needed in order to accomplish their tasks. From compliance to working safely around humans to whole-body manipulation of heavy objects, combining force and position control can dramatically expand the capabilities of robots. This is especially true for legged robots, which have so much ability to exert forces on the world around them. But how do we train robots which can do this? Baoxiong Jia tells us more in our discussion of his team’s recent, Best Paper Award winning work on learning a unified policy for position and force control, called UniFP. To learn more, watch Episode #49 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton.

RoboPapers

44,803 views • 7 months ago

We have seen a lot of legged robots doing navigation in the wild. But how about mobile manipulation in the wild? I have been pushing the direction of learning a unified, efficient, and dynamic 3D representation of scenes (for navigation) and objects (for manipulation) for the past two years. And now we have GeFF --- our large-scale, generalizable feature field, that combines the speed of a feed-forward neural network with the rich semantics from Foundation Models, to handle dynamically changing scenes, and enable open-ended, language-grounded scene and object understanding.

We have seen a lot of legged robots doing navigation in the wild. But how about mobile manipulation in the wild? I have been pushing the direction of learning a unified, efficient, and dynamic 3D representation of scenes (for navigation) and objects (for manipulation) for the past two years. And now we have GeFF --- our large-scale, generalizable feature field, that combines the speed of a feed-forward neural network with the rich semantics from Foundation Models, to handle dynamically changing scenes, and enable open-ended, language-grounded scene and object understanding.

Xiaolong Wang

42,767 views • 2 years ago

Work smarter with Clip Studio Paint's 3D models 💪 You can also use them as references for light sources, or to see how different objects bend and move. Have you tried using a 3D model in your artwork? Thanks for letting us share this, @Shei_babu!

Work smarter with Clip Studio Paint's 3D models 💪 You can also use them as references for light sources, or to see how different objects bend and move. Have you tried using a 3D model in your artwork? Thanks for letting us share this, @Shei_babu!

CLIP STUDIO PAINT

15,656 views • 1 year ago

A few weeks ago, we shared our progress on articulated objects and long-horizon tasks. Here are two representative examples: - We've been steadily expanding our asset library to cover more articulated objects. Articulated objects have always been a challenging asset class to handle in simulation. Interacting with them requires robots to master atomic skills such as pushing, pulling, opening, and closing, and to understand part structure, interaction constraints, and how the object moves. - Long-horizon tasks can now be generated at scale. Long-horizon tasks are the other hard category: they require chaining multiple sub-goals in sequence. A failure early in the task can cascade and make the rest unrecoverable. Axis is scaling along three dimensions at once: data volume, data quality, and task difficulty.

A few weeks ago, we shared our progress on articulated objects and long-horizon tasks. Here are two representative examples: - We've been steadily expanding our asset library to cover more articulated objects. Articulated objects have always been a challenging asset class to handle in simulation. Interacting with them requires robots to master atomic skills such as pushing, pulling, opening, and closing, and to understand part structure, interaction constraints, and how the object moves. - Long-horizon tasks can now be generated at scale. Long-horizon tasks are the other hard category: they require chaining multiple sub-goals in sequence. A failure early in the task can cascade and make the rest unrecoverable. Axis is scaling along three dimensions at once: data volume, data quality, and task difficulty.

Axis Robotics

13,478 views • 29 days ago

In their latest video, Boston Dynamics’s AI team explains how they make the Atlas humanoid perceive and interact with the world. Atlas uses an agile perception system to understand both the shape and context of objects in complex environments. Atlas combines 2D and 3D awareness, keypoint-based localization, and an object pose tracking system that fuses vision, kinematics, and object knowledge to handle occlusion and uncertainty. Accurate calibration ensures precise hand–eye coordination for reliable manipulation. The team is now working toward a unified model that merges perception and control – pushing beyond spatial AI toward physical intelligence.

In their latest video, Boston Dynamics’s AI team explains how they make the Atlas humanoid perceive and interact with the world. Atlas uses an agile perception system to understand both the shape and context of objects in complex environments. Atlas combines 2D and 3D awareness, keypoint-based localization, and an object pose tracking system that fuses vision, kinematics, and object knowledge to handle occlusion and uncertainty. Accurate calibration ensures precise hand–eye coordination for reliable manipulation. The team is now working toward a unified model that merges perception and control – pushing beyond spatial AI toward physical intelligence.

The Humanoid Hub

50,051 views • 1 year ago

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

RoboPapers

16,197 views • 1 month ago

Start designing for #VisionPro. Break the boundaries of 2D screens. I have done it... You can as well. Let me break down what happens in this video and explain how a tool like ShapesXR can help you #design for #spatialcomputing right now. 1️⃣ Assets and components you created in #Figma can be imported and synched in ShapesXR so you can start designing in #mixedreality without any #3d skills 2️⃣ You can prototype Gaze & Pinch interactions with head pose (on Quest 2 or 3) or actual eye tracking on Quest Pro. 3️⃣ You can switch seamlessly between VR and MR. Get tiny when you want to manipulate objects precisely or HUGE when you want to have a doll house view of your creation Be Brave, Be Bold, Think Spatial.

Start designing for #VisionPro. Break the boundaries of 2D screens. I have done it... You can as well. Let me break down what happens in this video and explain how a tool like ShapesXR can help you #design for #spatialcomputing right now. 1️⃣ Assets and components you created in #Figma can be imported and synched in ShapesXR so you can start designing in #mixedreality without any #3d skills 2️⃣ You can prototype Gaze & Pinch interactions with head pose (on Quest 2 or 3) or actual eye tracking on Quest Pro. 3️⃣ You can switch seamlessly between VR and MR. Get tiny when you want to manipulate objects precisely or HUGE when you want to have a doll house view of your creation Be Brave, Be Bold, Think Spatial.

Gabriele Romagnoli

67,836 views • 2 years ago

Exploring 3D Generation Capabilities in Alchemist AI🔮 Alchemist AI now enables users to create basic 3D assets, including models, objects, environments, and simulations, expanding its capabilities beyond 2D asset generation. What Does This Mean for You? • 3D Environment Creation: Generate virtual worlds, such as game environments or fantasy landscapes. Define parameters like terrain or structures to create dynamic and adaptable spaces tailored to your needs. • Customizable 3D Models and Objects: Design and modify 3D assets by adjusting dimensions, materials, and textures. Whether crafting prototypes or characters, users maintain full creative control. • Interactive Simulations: Build physics-based simulations or animated scenes. With upcoming support for sprite libraries and animation rigs, fine-tune object behaviors and interactions to suit your projects. With future API updates, Alchemist AI’s 3D generation capabilities will further expand, enhancing tools for creating models, environments, and simulations. AI-assisted text-to-3D will also be introduced—just describe your vision, such as 'space station' or 'orange sports car' and the system will generate customizable base assets.

Exploring 3D Generation Capabilities in Alchemist AI🔮 Alchemist AI now enables users to create basic 3D assets, including models, objects, environments, and simulations, expanding its capabilities beyond 2D asset generation. What Does This Mean for You? • 3D Environment Creation: Generate virtual worlds, such as game environments or fantasy landscapes. Define parameters like terrain or structures to create dynamic and adaptable spaces tailored to your needs. • Customizable 3D Models and Objects: Design and modify 3D assets by adjusting dimensions, materials, and textures. Whether crafting prototypes or characters, users maintain full creative control. • Interactive Simulations: Build physics-based simulations or animated scenes. With upcoming support for sprite libraries and animation rigs, fine-tune object behaviors and interactions to suit your projects. With future API updates, Alchemist AI’s 3D generation capabilities will further expand, enhancing tools for creating models, environments, and simulations. AI-assisted text-to-3D will also be introduced—just describe your vision, such as 'space station' or 'orange sports car' and the system will generate customizable base assets.

ALCHEMIST AI 🔮

35,641 views • 1 year ago

The power of generative models — now embodied in humanoids. Announcing DreamControl –– After a year-long research effort at General Robotics — we present a scalable framework for whole-body humanoid control that fuses diffusion priors with reinforcement learning to unlock real-world scene interaction. Diffusion + RL → natural whole-body skills on real robots. DreamControl enables humanoids to move beyond locomotion demos → performing natural, human-like skills such as –– Picking & lifting objects, Opening drawers & doors, Precise punching, kicking, and jumping, Bimanual manipulation tasks Our key innovation: a diffusion prior over human motion that guides RL, eliminating the need for massive teleoperation datasets, and producing motions that look human while transferring to real hardware. Trained purely in simulation, deployed on the Unitree G1 humanoid, DreamControl policies run in real time, bridging sim-to-real with unprecedented naturalness. We leverage a novel hybrid edge + cloud infrastructure that runs RL-trained policies on the edge backed by powerful AI models running in the cloud This is the next step in General Robotics’ journey toward general-purpose humanoid assistants that interact, adapt, and assist autonomously. Paper: Blog: 1/n

The power of generative models — now embodied in humanoids. Announcing DreamControl –– After a year-long research effort at General Robotics — we present a scalable framework for whole-body humanoid control that fuses diffusion priors with reinforcement learning to unlock real-world scene interaction. Diffusion + RL → natural whole-body skills on real robots. DreamControl enables humanoids to move beyond locomotion demos → performing natural, human-like skills such as –– Picking & lifting objects, Opening drawers & doors, Precise punching, kicking, and jumping, Bimanual manipulation tasks Our key innovation: a diffusion prior over human motion that guides RL, eliminating the need for massive teleoperation datasets, and producing motions that look human while transferring to real hardware. Trained purely in simulation, deployed on the Unitree G1 humanoid, DreamControl policies run in real time, bridging sim-to-real with unprecedented naturalness. We leverage a novel hybrid edge + cloud infrastructure that runs RL-trained policies on the edge backed by powerful AI models running in the cloud This is the next step in General Robotics’ journey toward general-purpose humanoid assistants that interact, adapt, and assist autonomously. Paper: Blog: 1/n

Ashish Kapoor

118,133 views • 10 months ago

Tencent announces AppAgent Multimodal Agents as Smartphone Users paper page: Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

Tencent announces AppAgent Multimodal Agents as Smartphone Users paper page: Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

AK

343,834 views • 2 years ago

Elon on Optimus in today’s CNBC interview ⦿ Currently, Optimus is being trained from demonstrations collected by humans wearing mocap suits with cameras on their heads – performing primitive tasks such as opening doors, picking up objects, and dancing. This is needed to bootstrap the intelligence to have basic functions before moving to more complex learning. ⦿ A major breakthrough would be the ability for Optimus to learn by watching videos – YouTube or how-to videos – similar to how humans can learn. This would unlock dramatic task extensibility. ⦿ Future progress will also involve self-play – where Optimus can interact with the environment and toys using the right reward function – similar to how children learn. ⦿ While some AI and compute advancements are still needed, Elon doesn’t see the threshold of intelligence as an insurmountable barrier. ⦿ Elon believes humanoid robots will become the biggest product ever – and demand will be insatiable. He sees tens of billions of robots as possible in the long run, though that’s at least a decade away. ⦿ Tesla's target is to produce 1 million Optimus robots by 2030 – which Elon considers a reasonable goal.

Elon on Optimus in today’s CNBC interview ⦿ Currently, Optimus is being trained from demonstrations collected by humans wearing mocap suits with cameras on their heads – performing primitive tasks such as opening doors, picking up objects, and dancing. This is needed to bootstrap the intelligence to have basic functions before moving to more complex learning. ⦿ A major breakthrough would be the ability for Optimus to learn by watching videos – YouTube or how-to videos – similar to how humans can learn. This would unlock dramatic task extensibility. ⦿ Future progress will also involve self-play – where Optimus can interact with the environment and toys using the right reward function – similar to how children learn. ⦿ While some AI and compute advancements are still needed, Elon doesn’t see the threshold of intelligence as an insurmountable barrier. ⦿ Elon believes humanoid robots will become the biggest product ever – and demand will be insatiable. He sees tens of billions of robots as possible in the long run, though that’s at least a decade away. ⦿ Tesla's target is to produce 1 million Optimus robots by 2030 – which Elon considers a reasonable goal.

The Humanoid Hub

29,921 views • 1 year ago

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Kaustubh Sridhar

52,158 views • 11 months ago

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Yunzhu Li

16,540 views • 1 year ago

For robots to be useful, they must be able to interact with a wide variety of environments; and yet, scaling interaction data is difficult, expensive, and time consuming. Instead, much research revolves around sim-to-real manipulation — but mostly this has not been mobile manipulation. Recently, though, this has begun to change. Two recent papers from Tairan He and Haoru Xue show us how to unlock the potential of this technique, building policies which, without any real data at all, can move objects around in the world and open doors in the real world with a humanoid robot. Watch Episode #60 of RoboPapers now to learn more, hosted by Chris Paxton and Jiafei Duan. In this episode, we cover two papers:. First is VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation; and second is DoorMan: Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer.

For robots to be useful, they must be able to interact with a wide variety of environments; and yet, scaling interaction data is difficult, expensive, and time consuming. Instead, much research revolves around sim-to-real manipulation — but mostly this has not been mobile manipulation. Recently, though, this has begun to change. Two recent papers from Tairan He and Haoru Xue show us how to unlock the potential of this technique, building policies which, without any real data at all, can move objects around in the world and open doors in the real world with a humanoid robot. Watch Episode #60 of RoboPapers now to learn more, hosted by Chris Paxton and Jiafei Duan. In this episode, we cover two papers:. First is VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation; and second is DoorMan: Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer.

RoboPapers

30,767 views • 5 months ago

Human skin plays an important role in how we interact with the world and robustly manipulate objects. It’s not just important when we can’t see things with out eyes, but when we want to pick up something heavy, or apply a very specific amount of force. So, it makes sense to want to give robots skin. Enter DexSkin: a soft, deformable electronic skin which can be applied across different surfaces and used to cover robot hands or fingers. Suzannah Wistreich and Baiyu Shi talk to us about their work building DexSkin, showing how it’s useful for policy learning, including online reinforcement learning, and how it' can be calibrated and policies transferred across sensors. They also open sourced their code and methods for building the sensors. To learn more, watch Episode #88 of RoboPapers now, hosted by Chris Paxton and Jiafei Duan!

Human skin plays an important role in how we interact with the world and robustly manipulate objects. It’s not just important when we can’t see things with out eyes, but when we want to pick up something heavy, or apply a very specific amount of force. So, it makes sense to want to give robots skin. Enter DexSkin: a soft, deformable electronic skin which can be applied across different surfaces and used to cover robot hands or fingers. Suzannah Wistreich and Baiyu Shi talk to us about their work building DexSkin, showing how it’s useful for policy learning, including online reinforcement learning, and how it' can be calibrated and policies transferred across sensors. They also open sourced their code and methods for building the sensors. To learn more, watch Episode #88 of RoboPapers now, hosted by Chris Paxton and Jiafei Duan!

RoboPapers

20,644 views • 25 days ago