正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Imitation learning has a data scarcity problem. Introducing EgoDex from Apple, the largest and most diverse dataset of dexterous human manipulation to date — 829 hours of egocentric video + paired 3D hand poses across 194 tasks. Now on arxiv: (1/4)

Ryan Hoque

1,971 subscribers

114,164 次观看 • 1 年前 •via X (Twitter)

科学技术新闻政治教育

Anya Rossi• Live Now

Private livecam show

11 条评论

Ryan Hoque 的头像

Ryan Hoque1 年前

Unlike teleoperation, egocentric video is passively scalable - like text and images on the Internet. We use Apple Vision Pro to collect video + precise pose annotations (unlike Ego4D, which lacks native pose data). This unlocks 5x the scale of existing large datasets like DROID.

Ryan Hoque 的头像

Ryan Hoque1 年前

We also propose new benchmarks and train imitation learning policies for dexterous trajectory prediction. Below are 30 Hz wrist and fingertip trajectories on the test set, where blue = ground truth, red = model predictions, and points get lighter up to 2 seconds in the future.

Ryan Hoque 的头像

Ryan Hoque1 年前

The full dataset is now publicly available to the community, access details are in the paper. Sample code for data loading is coming soon. Enjoy!

SecBriefs | Making Cybersecurity Simple 的头像

SecBriefs | Making Cybersecurity Simple1 年前

⚠️The average person generates 2.5 quintillion bytes of data annually. That's enough to fill 575,000 libraries!📚 This data is used to track, target, and manipulate you. #Cybersecurity matters.💡 Cybersecurity Dictionary for Everyone is on Apple Books:

Raul 的头像

Raul1 年前

Perfect for Optimus to learn new skills

RL 的头像

RL1 年前

Fyi the dataset links dont work: “ NoSuchKeyThe specified key does not exist.datasets/egodex/[filename].zip9C2FBJJ7FJHKHDT3Uk3l1oHoR9NeaNJdC7gInDjt5u8slFtW5lRt9wFR0MQIWNXIk4sTWiLGEYF22KUPQQ9X6CVC+UU=”

Michael Black 的头像

Michael Black1 年前

Looks great. You mention that it’s now public but I don’t find the link anywhere.

Hussein Lezzaik 的头像

Hussein Lezzaik1 年前

excellent work, congrats!

Soroush Nasiriany 的头像

Soroush Nasiriany1 年前

Congrats Ryan! Awesome work as always!!

Humanoids daily 的头像

Humanoids daily1 年前

Impressive! Interesting use of the Apple Vision Pro.

Idriel Vermillion 的头像

Idriel Vermillion1 年前

So hype

相关视频

Impressive on every dimension! Genesis AI has announced GENE-26.5, its first robotic foundation model, aimed at achieving human-level dexterous manipulation: - Single model, shared weights, handles egg cracking, lab pipetting, bimanual Rubik's Cube, smoothie making, and wire harnessing. Most tasks: under 1 hour of task-specific data, 1x real-world speed. - Genesis Hand 1.0: human-size, 20 active DoF, soft-contact skin. Paired glove gives a 1:1:1 mapping (glove, human hand, robot hand). 100x cheaper than teleop, 5x more data-efficient internally. - Data engine pulls from three sources: glove data, egocentric (head-cam) video, and large-scale internet video. The 1:1 hand-to-human match closes the embodiment gap, letting Genesis use video data more effectively than rivals. - Flow matching across vision, tactile, proprioception, and language inputs. Per Genesis, scaling data and compute directly improves zero-shot

Impressive on every dimension! Genesis AI has announced GENE-26.5, its first robotic foundation model, aimed at achieving human-level dexterous manipulation: - Single model, shared weights, handles egg cracking, lab pipetting, bimanual Rubik's Cube, smoothie making, and wire harnessing. Most tasks: under 1 hour of task-specific data, 1x real-world speed. - Genesis Hand 1.0: human-size, 20 active DoF, soft-contact skin. Paired glove gives a 1:1:1 mapping (glove, human hand, robot hand). 100x cheaper than teleop, 5x more data-efficient internally. - Data engine pulls from three sources: glove data, egocentric (head-cam) video, and large-scale internet video. The 1:1 hand-to-human match closes the embodiment gap, letting Genesis use video data more effectively than rivals. - Flow matching across vision, tactile, proprioception, and language inputs. Per Genesis, scaling data and compute directly improves zero-shot

The Humanoid Hub

95,119 次观看 • 2 个月前

Happy to share what I’ve been working on since joining Genesis! GENE-26.5 is a one-of-a-kind, robotics-native multimodal foundation model that learns from diverse, in-the-wild data across modalities and outputs actions enabling a 54-DoF robot system to perform the most dexterous, long-horizon manipulation tasks to date—approaching human-level capability. This is the result of innovations across the full stack—data collection and processing, robot systems, model architecture, training strategies, and scalable evaluation infrastructure.

Happy to share what I’ve been working on since joining Genesis! GENE-26.5 is a one-of-a-kind, robotics-native multimodal foundation model that learns from diverse, in-the-wild data across modalities and outputs actions enabling a 54-DoF robot system to perform the most dexterous, long-horizon manipulation tasks to date—approaching human-level capability. This is the result of innovations across the full stack—data collection and processing, robot systems, model architecture, training strategies, and scalable evaluation infrastructure.

Zu Wang

19,459 次观看 • 2 个月前

⚡️EgoVerse is a first-of-its-kind, collaborative ecosystem for human-to-robot learning. The consortium leverages Project Aria to capture high-fidelity, egocentric human data — including 3D hand and head poses — to train next-gen robot manipulation policies. With over 1,300 hours of data across 2,000+ tasks, EgoVerse is a prime example of how the Aria Research Kit is being used by our partners to accelerate the future of embodied AI. Learn more: 🔗 📰 Apply for the Aria Research Kit: #MachineLearning #Robotics #ProjectAria #EgoVerse #ComputerVision Simar Kareer , Ryan Punamiya , Roger Qiu, Xiongyi Cai , Alexey Gavryushin

⚡️EgoVerse is a first-of-its-kind, collaborative ecosystem for human-to-robot learning. The consortium leverages Project Aria to capture high-fidelity, egocentric human data — including 3D hand and head poses — to train next-gen robot manipulation policies. With over 1,300 hours of data across 2,000+ tasks, EgoVerse is a prime example of how the Aria Research Kit is being used by our partners to accelerate the future of embodied AI. Learn more: 🔗 📰 Apply for the Aria Research Kit: #MachineLearning #Robotics #ProjectAria #EgoVerse #ComputerVision Simar Kareer , Ryan Punamiya , Roger Qiu, Xiongyi Cai , Alexey Gavryushin

Project Aria @Meta

17,982 次观看 • 3 个月前

Egocentric Videos for Robot Training 🧵 Researchers at NYU and UC Berkeley published research where they developed a system called EgoZero that trains robots using human demonstration videos recorded with smart glasses. 👓 The system converts first-person human actions into 3D point-based state-action representations. The policies executed on a gripper-equipped robot, achieved a 70% zero-shot success rate across seven manipulation tasks, with only 20 minutes of human data per task. EgoZero stands as one of the solid empirical proofs that egocentric smart-glass video collected from everyday human behavior can serve as powerful, scalable training data for real robot learning. Vincent Liu Ademi Adeniji

Egocentric Videos for Robot Training 🧵 Researchers at NYU and UC Berkeley published research where they developed a system called EgoZero that trains robots using human demonstration videos recorded with smart glasses. 👓 The system converts first-person human actions into 3D point-based state-action representations. The policies executed on a gripper-equipped robot, achieved a 70% zero-shot success rate across seven manipulation tasks, with only 20 minutes of human data per task. EgoZero stands as one of the solid empirical proofs that egocentric smart-glass video collected from everyday human behavior can serve as powerful, scalable training data for real robot learning. Vincent Liu Ademi Adeniji

VaderResearch

23,055 次观看 • 10 个月前

Teaching robots to perform dexterous manipulation tasks currently requires teleoperation, which limits demonstration quality, speed, and scalability. Instead, why not use human videos? The problem is that a human hand isn’t a robot hand, so data must be retargeted using simulation to resolve issues like collisions and interpenetration when controlling the hand. In VideoManip, Hongyi Chen and co-authors built a system to solve this problem, taking in RGB videos of humans performing manipulation tasks and using them to create accurate simulations with which to learn robot policies. Watch episode #73 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton, now to learn more!

Teaching robots to perform dexterous manipulation tasks currently requires teleoperation, which limits demonstration quality, speed, and scalability. Instead, why not use human videos? The problem is that a human hand isn’t a robot hand, so data must be retargeted using simulation to resolve issues like collisions and interpenetration when controlling the hand. In VideoManip, Hongyi Chen and co-authors built a system to solve this problem, taking in RGB videos of humans performing manipulation tasks and using them to create accurate simulations with which to learn robot policies. Watch episode #73 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton, now to learn more!

RoboPapers

27,249 次观看 • 3 个月前

Tactile feedback is one of the most important modalities in manipulation, but has been underutilized in dexterous hands. T-Dex is a framework for learning dexterous policies from tactile play data, beating vision and torque-based methods by 1.7x. 🧵👇

Tactile feedback is one of the most important modalities in manipulation, but has been underutilized in dexterous hands. T-Dex is a framework for learning dexterous policies from tactile play data, beating vision and torque-based methods by 1.7x. 🧵👇

Lerrel Pinto

84,608 次观看 • 3 年前

This giant free dataset could make helper robots way smarter, way faster: An open-source robotics stack from Berkeley AI researchers featuring the largest teleoperation dataset released to date with over 3,500 hours of bimanual manipulation data across 200 tasks. The video showcases autonomous bimanual robot performance on dexterous tasks including box folding, Lego sorting, AirPod insertion, t-shirt folding, backpack packing, and box unlocking using learned policies. Sim-to-real correlations, training insights like flow loss predicting real-world success, lightweight infrastructure for DAgger interventions Thank you for sharing, Ritvik Singh, and everyone else who contributed to this! Links to the paper plus dataset at under permissive licensing. ——- Weekly robotics and AI insights. Subscribe free:

This giant free dataset could make helper robots way smarter, way faster: An open-source robotics stack from Berkeley AI researchers featuring the largest teleoperation dataset released to date with over 3,500 hours of bimanual manipulation data across 200 tasks. The video showcases autonomous bimanual robot performance on dexterous tasks including box folding, Lego sorting, AirPod insertion, t-shirt folding, backpack packing, and box unlocking using learned policies. Sim-to-real correlations, training insights like flow loss predicting real-world success, lightweight infrastructure for DAgger interventions Thank you for sharing, Ritvik Singh, and everyone else who contributed to this! Links to the paper plus dataset at under permissive licensing. ——- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

13,264 次观看 • 1 个月前

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Xiao Ma

93,811 次观看 • 6 个月前

From day one, mimic has been focused on a single goal: general-purpose dexterous manipulation. Today we're proud to announce the mimic hand M1 and the mimic wearable U1. We believe the only way to solve dexterous manipulation at scale is by going full-stack at the frontier of physical AI, building every layer ourselves around one fixed point, the human hand. The M1 is a highly backdrivable, tendon-driven hand that covers the full range of human capability, from heavy payloads to fine manipulation.

From day one, mimic has been focused on a single goal: general-purpose dexterous manipulation. Today we're proud to announce the mimic hand M1 and the mimic wearable U1. We believe the only way to solve dexterous manipulation at scale is by going full-stack at the frontier of physical AI, building every layer ourselves around one fixed point, the human hand. The M1 is a highly backdrivable, tendon-driven hand that covers the full range of human capability, from heavy payloads to fine manipulation.

mimic

86,789 次观看 • 6 天前

Most of today's AI can see the world, but it doesn’t **feel** it. Capturing the sense of touch is crucial for dexterous robotic manipulation, user modeling, and understanding physical interactions. Introducing OpenTouch: bringing full-hand tactile sensing into real-world AI🖐️ OpenTouch is collected in-the-wild using tactile sensing gloves, hand pose tracking gloves, and egocentric glasses. It includes: • 5 hours of real-world data, • 3 hours densely annotated contact-rich interactions, • 2,900 curated interaction clips, • across 800 objects, 14 environments, and 29 grasp types. all open at:

Most of today's AI can see the world, but it doesn’t feel it. Capturing the sense of touch is crucial for dexterous robotic manipulation, user modeling, and understanding physical interactions. Introducing OpenTouch: bringing full-hand tactile sensing into real-world AI🖐️ OpenTouch is collected in-the-wild using tactile sensing gloves, hand pose tracking gloves, and egocentric glasses. It includes: • 5 hours of real-world data, • 3 hours densely annotated contact-rich interactions, • 2,900 curated interaction clips, • across 800 objects, 14 environments, and 29 grasp types. all open at:

Paul Liang

47,158 次观看 • 3 个月前

Real-world robot data is expensive and slow to collect, creating a major challenge for humanoid development. 🤖 The NVIDIA GR00T N1.6 open vision language action model is pre-trained on a diverse mix of data, including thousands of hours of Stanford Vision and Learning Lab’s BEHAVIOR simulation data, which covers long-horizon everyday manipulation tasks. This diverse training is the key to robust cross-embodiment performance and real-world adaptability. 🌍 Read the blog 🔗

Real-world robot data is expensive and slow to collect, creating a major challenge for humanoid development. 🤖 The NVIDIA GR00T N1.6 open vision language action model is pre-trained on a diverse mix of data, including thousands of hours of Stanford Vision and Learning Lab’s BEHAVIOR simulation data, which covers long-horizon everyday manipulation tasks. This diverse training is the key to robust cross-embodiment performance and real-world adaptability. 🌍 Read the blog 🔗

NVIDIA Robotics

13,456 次观看 • 5 个月前

The problem with humanoid teleoperation is that it is expensive and difficult to scale Enter NVIDIA's EgoScale: - A VLA model pretrained on thousands hours of egocentric human videos. - Mid-trained via 50 hours of human + 4 hours of robot "play" data for human-robot alignment. - Fine-tuned with very few examples of task-specific robot teleoperation (100 or fewer per task). - Successfully transfers across 5-finger (Sharpa) and 3-finger (Unitree G1) robot hands. - Performance scales predictably as data increases.

The problem with humanoid teleoperation is that it is expensive and difficult to scale Enter NVIDIA's EgoScale: - A VLA model pretrained on thousands hours of egocentric human videos. - Mid-trained via 50 hours of human + 4 hours of robot "play" data for human-robot alignment. - Fine-tuned with very few examples of task-specific robot teleoperation (100 or fewer per task). - Successfully transfers across 5-finger (Sharpa) and 3-finger (Unitree G1) robot hands. - Performance scales predictably as data increases.

The Humanoid Hub

44,441 次观看 • 4 个月前

We are excited to share EMDB, a novel dataset of 3D human poses for in-the-wild monocular videos, including global trajectories. Data and toolkit code is now available. More details in the thread below. Project Page:

We are excited to share EMDB, a novel dataset of 3D human poses for in-the-wild monocular videos, including global trajectories. Data and toolkit code is now available. More details in the thread below. Project Page:

AIT Lab

14,601 次观看 • 2 年前

We present EgoReAct: Real-time 3D human reaction generation from streaming egocentric video. 🌟Reacting to streaming egocentric video is something humans do every day. We hope EgoReAct makes human motion more human-like. 🔎 What we found: existing ego-reaction data can be spatially inconsistent (e.g., moving reactions paired with fixed-camera videos), which breaks 3D grounding. 📷 What we built: HRD, a spatially aligned egocentric video–reaction dataset (3,500 pairs, 32 categories), plus a spatially aligned ViMo fix for fair evaluation. (Instead of collecting expensive ground-truth motion, we employ VDM to generate the egocentric videos.) 👁️⚡🏃 Our simple yet effective pipeline: motion tokenization for compact discrete codes + an autoregressive Transformer for online, strictly-causal generation. Metric depth and head dynamics further improve 3D spatial consistency. Project Page: ArXiv: #HumanMotion #EgocentricVision #3D #ARVR #Animation #AIGC #DeepLearning #GenerativeAI #Graphics #ComputerVision #Motion

We present EgoReAct: Real-time 3D human reaction generation from streaming egocentric video. 🌟Reacting to streaming egocentric video is something humans do every day. We hope EgoReAct makes human motion more human-like. 🔎 What we found: existing ego-reaction data can be spatially inconsistent (e.g., moving reactions paired with fixed-camera videos), which breaks 3D grounding. 📷 What we built: HRD, a spatially aligned egocentric video–reaction dataset (3,500 pairs, 32 categories), plus a spatially aligned ViMo fix for fair evaluation. (Instead of collecting expensive ground-truth motion, we employ VDM to generate the egocentric videos.) 👁️⚡🏃 Our simple yet effective pipeline: motion tokenization for compact discrete codes + an autoregressive Transformer for online, strictly-causal generation. Metric depth and head dynamics further improve 3D spatial consistency. Project Page: ArXiv: #HumanMotion #EgocentricVision #3D #ARVR #Animation #AIGC #DeepLearning #GenerativeAI #Graphics #ComputerVision #Motion

Zhiyang (Frank) Dou

11,336 次观看 • 6 个月前

Learning dexterous policies from human videos is challenging due to differences between human and robot hands. We present HuDOR, a method that learns dexterous policies within the robot's physical constraints using just one human video and an hour of online interactions! [1/n]

Learning dexterous policies from human videos is challenging due to differences between human and robot hands. We present HuDOR, a method that learns dexterous policies within the robot's physical constraints using just one human video and an hour of online interactions! [1/n]

Irmak Guzey

65,121 次观看 • 1 年前

1/ 🧠Humans are the best robot data source! 2/ 👓Human egocentric video is rich in quantity, but poor in quality. 3/ Beyond scaling data, smarter representation and architecture matter just as much. 4/ Want an open-source framework to train your own learn-from-human-data robot policy? 🚀We introduce HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos⬇️ ✦ Zero-Shot Human-to-Robot Transfer ✦ Robot-Data-Free ✦ Just 30 min of data per task ✦ Collect by Anyone, Anytime, Anywhere ✦ Deploy on Any Robot, Any Camera, Any Environment ✦ Open-Source & Easy-to-Implement Let's squeeze every bit of signal out of human data! 🌐 Website: 📄 Paper: 💻 Code: 📹 Video: 🧵 1/n

1/ 🧠Humans are the best robot data source! 2/ 👓Human egocentric video is rich in quantity, but poor in quality. 3/ Beyond scaling data, smarter representation and architecture matter just as much. 4/ Want an open-source framework to train your own learn-from-human-data robot policy? 🚀We introduce HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos⬇️ ✦ Zero-Shot Human-to-Robot Transfer ✦ Robot-Data-Free ✦ Just 30 min of data per task ✦ Collect by Anyone, Anytime, Anywhere ✦ Deploy on Any Robot, Any Camera, Any Environment ✦ Open-Source & Easy-to-Implement Let's squeeze every bit of signal out of human data! 🌐 Website: 📄 Paper: 💻 Code: 📹 Video: 🧵 1/n

Zhi (Leo) Wang

109,595 次观看 • 1 个月前

In my experience, robot 'generalists' are often jacks of all trades but masters of none. In training across multiple tasks and environments, robot policies fail to generalize robustly and effectively to each particular test setting. What if at test time, we non-parametrically *retrieved* “relevant” data from the training set and used it to significantly improve the performance of few-shot imitation learning to be robust to various test time scenes. Notably, we are *not* collecting lots of new data, just training more on sub-components of the same training data! Now, we’re certainly not the first to suggest retrieval, but in our new work - STRAP, we show how retrieving relevant *sub-trajectories* from offline datasets can significantly increase data reuse across tasks, when paired with an appropriate metric space. A 🧵 (1/7)

In my experience, robot 'generalists' are often jacks of all trades but masters of none. In training across multiple tasks and environments, robot policies fail to generalize robustly and effectively to each particular test setting. What if at test time, we non-parametrically retrieved “relevant” data from the training set and used it to significantly improve the performance of few-shot imitation learning to be robust to various test time scenes. Notably, we are not collecting lots of new data, just training more on sub-components of the same training data! Now, we’re certainly not the first to suggest retrieval, but in our new work - STRAP, we show how retrieving relevant sub-trajectories from offline datasets can significantly increase data reuse across tasks, when paired with an appropriate metric space. A 🧵 (1/7)

Abhishek Gupta

12,045 次观看 • 1 年前

The most frustrating part of imitation learning is collecting huge amounts of teleop data. But why teleop robots when robots can learn by watching us? Introducing Point Policy, a novel framework that enables robots to learn from human videos without any teleop, sim2real, or RL.

The most frustrating part of imitation learning is collecting huge amounts of teleop data. But why teleop robots when robots can learn by watching us? Introducing Point Policy, a novel framework that enables robots to learn from human videos without any teleop, sim2real, or RL.

Siddhant Haldar

69,056 次观看 • 1 年前

Meta FAIR recently released the Seamless Interaction Dataset, the largest known high-quality video dataset of its kind, with: 4,000+ diverse participants 4,000+ hours of footage 65k+ interactions 5,000+ annotated samples This dataset of full-body, in-person, face-to-face interaction videos represents a crucial stepping stone to understanding and modeling how people communicate and behave when they’re together—advancing AI's ability to generate more natural conversations and human-like gestures. Download the dataset on Hugging Face: Learn more about the dataset:

Meta FAIR recently released the Seamless Interaction Dataset, the largest known high-quality video dataset of its kind, with: 4,000+ diverse participants 4,000+ hours of footage 65k+ interactions 5,000+ annotated samples This dataset of full-body, in-person, face-to-face interaction videos represents a crucial stepping stone to understanding and modeling how people communicate and behave when they’re together—advancing AI's ability to generate more natural conversations and human-like gestures. Download the dataset on Hugging Face: Learn more about the dataset:

AI at Meta

23,836 次观看 • 1 年前

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a single teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:

Jim Fan

293,383 次观看 • 4 个月前