Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Presenting DemoDiffusion: An extremely simple approach enabling a pre-trained 'generalist' diffusion policy to follow a human-demonstration for a novel task during inference One-shot human imitation without requiring any paired human-robot data or online RL 🙂 1/n

Homanga Bharadhwaj

3,070 subscribers

32,919 Aufrufe • vor 1 Jahr •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

8 Kommentare

Profilbild von Homanga Bharadhwaj

Homanga Bharadhwajvor 1 Jahr

The key insight of DemoDiffusion is to start the denoising process for the diffusion policy with the re-targeted human hand trajectory (instead of starting from pure noise) This simple approach doesn't require fine-tuning/updating the diffusion policy in any way! 2/n

Profilbild von Homanga Bharadhwaj

Homanga Bharadhwajvor 1 Jahr

Results show that DemoDiffusion can perform tasks that the pre-trained diffusion policy (pi-0) fails at zero-shot, just from one human demonstration of the task! 3/n

Profilbild von Homanga Bharadhwaj

Homanga Bharadhwajvor 1 Jahr

We even see zero-shot generalization to objects different from what the human demonstration was shown on! This suggests DemoDiffusion is able to exploit the semantic/spatial generalization of the pre-trained diffusion policy - while guiding it based on the human demo 4/n

Profilbild von Homanga Bharadhwaj

Homanga Bharadhwajvor 1 Jahr

DemoDiffusion is made possible by @sungj1026 's amazing lead, and @shubhtuls 's precise insights on diffusion models @CMU_Robotics Code, Videos, Paper: (finally, thanks to @physical_int for pi0 and @geopavlakos @JitendraMalikCV et al. for HaMeR) n/n

Profilbild von Homanga Bharadhwaj

Homanga Bharadhwajvor 1 Jahr

@shubhtuls @CMU_Robotics @physical_int @geopavlakos @JitendraMalikCV Also check out this alternate thread from @sungj1026 on DemoDiffusion (n+1)/n

Profilbild von Ted Xiao

Ted Xiaovor 1 Jahr

Nice work! Warm-starting the denoising progress with a human prior is very smart.

Profilbild von Himanshu Kumar

Himanshu Kumarvor 1 Jahr

Perhaps true mastery lies in effortless adaptation, not rigid programming.

Profilbild von Arsen Ibragimov

Arsen Ibragimovvor 1 Jahr

Thats clever, skipping the fine-tuning part is a flex

Ähnliche Videos

🤷What if we want to learn from human data... without human data? In our work NIL (No-data Imitation Learning) #CVPR2026, we explore a simple but ambitious question: Can robots learn directly from AI-generated videos without any curated demonstration data? 🔗

🤷What if we want to learn from human data... without human data? In our work NIL (No-data Imitation Learning) #CVPR2026, we explore a simple but ambitious question: Can robots learn directly from AI-generated videos without any curated demonstration data? 🔗

Chenhao Li

22,062 Aufrufe • vor 1 Monat

The most frustrating part of imitation learning is collecting huge amounts of teleop data. But why teleop robots when robots can learn by watching us? Introducing Point Policy, a novel framework that enables robots to learn from human videos without any teleop, sim2real, or RL.

The most frustrating part of imitation learning is collecting huge amounts of teleop data. But why teleop robots when robots can learn by watching us? Introducing Point Policy, a novel framework that enables robots to learn from human videos without any teleop, sim2real, or RL.

Siddhant Haldar

69,056 Aufrufe • vor 1 Jahr

1/ 🧠Humans are the best robot data source! 2/ 👓Human egocentric video is rich in quantity, but poor in quality. 3/ Beyond scaling data, smarter representation and architecture matter just as much. 4/ Want an open-source framework to train your own learn-from-human-data robot policy? 🚀We introduce HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos⬇️ ✦ Zero-Shot Human-to-Robot Transfer ✦ Robot-Data-Free ✦ Just 30 min of data per task ✦ Collect by Anyone, Anytime, Anywhere ✦ Deploy on Any Robot, Any Camera, Any Environment ✦ Open-Source & Easy-to-Implement Let's squeeze every bit of signal out of human data! 🌐 Website: 📄 Paper: 💻 Code: 📹 Video: 🧵 1/n

1/ 🧠Humans are the best robot data source! 2/ 👓Human egocentric video is rich in quantity, but poor in quality. 3/ Beyond scaling data, smarter representation and architecture matter just as much. 4/ Want an open-source framework to train your own learn-from-human-data robot policy? 🚀We introduce HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos⬇️ ✦ Zero-Shot Human-to-Robot Transfer ✦ Robot-Data-Free ✦ Just 30 min of data per task ✦ Collect by Anyone, Anytime, Anywhere ✦ Deploy on Any Robot, Any Camera, Any Environment ✦ Open-Source & Easy-to-Implement Let's squeeze every bit of signal out of human data! 🌐 Website: 📄 Paper: 💻 Code: 📹 Video: 🧵 1/n

Zhi (Leo) Wang

109,595 Aufrufe • vor 2 Monaten

HDMI (HumanoiD iMitation for Interaction) is a framework enabling humanoid robots to learn whole-body object interaction skills from monocular RGB human videos. It extracts and retargets human poses and object trajectories using GVHMR and LocoMujoco, building reference datasets with contact annotations. The data is used to train an RL policy via robot-object co-tracking. HDMI achieved 67 consecutive door traversals.

HDMI (HumanoiD iMitation for Interaction) is a framework enabling humanoid robots to learn whole-body object interaction skills from monocular RGB human videos. It extracts and retargets human poses and object trajectories using GVHMR and LocoMujoco, building reference datasets with contact annotations. The data is used to train an RL policy via robot-object co-tracking. HDMI achieved 67 consecutive door traversals.

The Humanoid Hub

17,395 Aufrufe • vor 10 Monaten

Can we synthesize 3D human-scene interactions without learning from any 3D data? Yes! Check out Lei Li's GenZI, a novel zero-shot approach to generating 3D interactions by distilling priors from large vision-language models.

Can we synthesize 3D human-scene interactions without learning from any 3D data? Yes! Check out Lei Li's GenZI, a novel zero-shot approach to generating 3D interactions by distilling priors from large vision-language models.

Angela Dai

106,862 Aufrufe • vor 2 Jahren

Learning dexterous policies from human videos is challenging due to differences between human and robot hands. We present HuDOR, a method that learns dexterous policies within the robot's physical constraints using just one human video and an hour of online interactions! [1/n]

Learning dexterous policies from human videos is challenging due to differences between human and robot hands. We present HuDOR, a method that learns dexterous policies within the robot's physical constraints using just one human video and an hour of online interactions! [1/n]

Irmak Guzey

65,121 Aufrufe • vor 1 Jahr

The problem with humanoid teleoperation is that it is expensive and difficult to scale Enter NVIDIA's EgoScale: - A VLA model pretrained on thousands hours of egocentric human videos. - Mid-trained via 50 hours of human + 4 hours of robot "play" data for human-robot alignment. - Fine-tuned with very few examples of task-specific robot teleoperation (100 or fewer per task). - Successfully transfers across 5-finger (Sharpa) and 3-finger (Unitree G1) robot hands. - Performance scales predictably as data increases.

The problem with humanoid teleoperation is that it is expensive and difficult to scale Enter NVIDIA's EgoScale: - A VLA model pretrained on thousands hours of egocentric human videos. - Mid-trained via 50 hours of human + 4 hours of robot "play" data for human-robot alignment. - Fine-tuned with very few examples of task-specific robot teleoperation (100 or fewer per task). - Successfully transfers across 5-finger (Sharpa) and 3-finger (Unitree G1) robot hands. - Performance scales predictably as data increases.

The Humanoid Hub

44,441 Aufrufe • vor 5 Monaten

Tired of teleoperating your robots? We built a way to scale robot datasets without teleop, dynamic simulation, or even robot hardware. Just one smartphone scan + one human hand demo video → thousands of diverse robot trajectories. Trainable by diffusion policy and VLA models as-is. Introducing: Real2Render2Real 👉

Tired of teleoperating your robots? We built a way to scale robot datasets without teleop, dynamic simulation, or even robot hardware. Just one smartphone scan + one human hand demo video → thousands of diverse robot trajectories. Trainable by diffusion policy and VLA models as-is. Introducing: Real2Render2Real 👉

Max Fu

69,381 Aufrufe • vor 1 Jahr

What happens when robot world models learn from human experience at scale? 🤔 DreamDojo from NVIDIA Research is a generalist robot world model pretrained on 44K hours of egocentric human videos and then post-trained on robot data to generalize across new objects and environments. After distillation, it runs at 10 FPS for live teleoperation, policy evaluation, and model-based planning. Read the ICML paper to learn more 📄

What happens when robot world models learn from human experience at scale? 🤔 DreamDojo from NVIDIA Research is a generalist robot world model pretrained on 44K hours of egocentric human videos and then post-trained on robot data to generalize across new objects and environments. After distillation, it runs at 10 FPS for live teleoperation, policy evaluation, and model-based planning. Read the ICML paper to learn more 📄

NVIDIA Robotics

22,413 Aufrufe • vor 27 Tagen

Sim2Real RL for Vision-Based Dexterous Manipulation on Humanoids TLDR - we train a humanoid robot with two multifingered hands to perform a range of dexterous manipulation tasks robust generalization and high performance without human demonstration :D

Sim2Real RL for Vision-Based Dexterous Manipulation on Humanoids TLDR - we train a humanoid robot with two multifingered hands to perform a range of dexterous manipulation tasks robust generalization and high performance without human demonstration :D

Toru

49,561 Aufrufe • vor 1 Jahr

RobotMDM, by Disney Research, combines diffusion-based motion generation with RL to produce physics-aware humanoid motions from text prompts. Trained on human motion data with a reward surrogate for physical feasibility, it ensures realistic motions.

RobotMDM, by Disney Research, combines diffusion-based motion generation with RL to produce physics-aware humanoid motions from text prompts. Trained on human motion data with a reward surrogate for physical feasibility, it ensures realistic motions.

The Humanoid Hub

22,943 Aufrufe • vor 1 Jahr

We discovered an emergent property of VLAs like π0/π0.5/π0.6: as we scale up pre-training, the model learns to align human videos and robot data! This gives us a simple way to leverage human videos. Once π0.5 knows how to control robots, it can naturally learn from human video.

We discovered an emergent property of VLAs like π0/π0.5/π0.6: as we scale up pre-training, the model learns to align human videos and robot data! This gives us a simple way to leverage human videos. Once π0.5 knows how to control robots, it can naturally learn from human video.

Physical Intelligence

1,184,212 Aufrufe • vor 7 Monaten

Egocentric Videos for Robot Training 🧵 Researchers at NYU and UC Berkeley published research where they developed a system called EgoZero that trains robots using human demonstration videos recorded with smart glasses. 👓 The system converts first-person human actions into 3D point-based state-action representations. The policies executed on a gripper-equipped robot, achieved a 70% zero-shot success rate across seven manipulation tasks, with only 20 minutes of human data per task. EgoZero stands as one of the solid empirical proofs that egocentric smart-glass video collected from everyday human behavior can serve as powerful, scalable training data for real robot learning. Vincent Liu Ademi Adeniji

Egocentric Videos for Robot Training 🧵 Researchers at NYU and UC Berkeley published research where they developed a system called EgoZero that trains robots using human demonstration videos recorded with smart glasses. 👓 The system converts first-person human actions into 3D point-based state-action representations. The policies executed on a gripper-equipped robot, achieved a 70% zero-shot success rate across seven manipulation tasks, with only 20 minutes of human data per task. EgoZero stands as one of the solid empirical proofs that egocentric smart-glass video collected from everyday human behavior can serve as powerful, scalable training data for real robot learning. Vincent Liu Ademi Adeniji

VaderResearch

23,055 Aufrufe • vor 10 Monaten

Wouldn't it be great if we could train robots without any teleoperation! In our latest paper, we train robots to mimic a human video of the task by simply matching the object features using RL. We only need one video and under an hour of robot training.

Wouldn't it be great if we could train robots without any teleoperation! In our latest paper, we train robots to mimic a human video of the task by simply matching the object features using RL. We only need one video and under an hour of robot training.

Lerrel Pinto

46,221 Aufrufe • vor 1 Jahr

How to learn dexterous manipulation for any robot hand from a single human demonstration? Check out DexMachina, our new RL algorithm that learns long-horizon, bimanual dexterous policies for a variety of dexterous hands, articulated objects, and complex motions.

How to learn dexterous manipulation for any robot hand from a single human demonstration? Check out DexMachina, our new RL algorithm that learns long-horizon, bimanual dexterous policies for a variety of dexterous hands, articulated objects, and complex motions.

Mandi Zhao

120,954 Aufrufe • vor 1 Jahr

We developed a simple, sample-efficient online RL technique for post-training image generation models. We see it as a possible steerable alternative to CFG, driven by any scalar reward, including human preference.

We developed a simple, sample-efficient online RL technique for post-training image generation models. We see it as a possible steerable alternative to CFG, driven by any scalar reward, including human preference.

David McAllister

66,266 Aufrufe • vor 3 Monaten

NVIDIA just announced EgoScale 🤖🧠 NVIDIA Research has uncovered a log-linear scaling law for robot dexterity by pretraining VLA models on over 20,000 hours of egocentric human video This massive dataset is 20 times larger than previous efforts and proves that robot intelligence follows a predictable path: the more human data, the lower the loss The secret is a simple recipe combining large-scale human pretraining with a small amount of aligned human-robot mid-training to bridge the gap In testing, this method boosted the average success rate by 54% on a 22-DoF robotic hand compared to policies built without pretraining EgoScale also enables one-shot task adaptation and works across different hardware, suggesting that human motion is a universal motor prior for robots Website: Paper: Source: NVIDIA Research #Robot #Humanoid #Robotics #AI #EmbodiedAI #PhysicalAI #NVIDIA #EgoScale #GR00T

NVIDIA just announced EgoScale 🤖🧠 NVIDIA Research has uncovered a log-linear scaling law for robot dexterity by pretraining VLA models on over 20,000 hours of egocentric human video This massive dataset is 20 times larger than previous efforts and proves that robot intelligence follows a predictable path: the more human data, the lower the loss The secret is a simple recipe combining large-scale human pretraining with a small amount of aligned human-robot mid-training to bridge the gap In testing, this method boosted the average success rate by 54% on a 22-DoF robotic hand compared to policies built without pretraining EgoScale also enables one-shot task adaptation and works across different hardware, suggesting that human motion is a universal motor prior for robots Website: Paper: Source: NVIDIA Research #Robot #Humanoid #Robotics #AI #EmbodiedAI #PhysicalAI #NVIDIA #EgoScale #GR00T

RoboHub🤖

43,752 Aufrufe • vor 5 Monaten

1/🧠Humans are the best robot data source — but video alone misses one thing: force. 2/🙁Tactile gloves capture force — but they're costly and block the real touch manipulation depends on. 3/💪Maybe the future of touch lives on your wrist: surface EMG reads the muscles that cause force — tactile sensing without ever touching a tactile sensor. 4/🔥Want a fully open-source framework — hardware + software — to train your own force-aware learn-from-human-data robot policy? 🚀We introduce ForceBand: Learning Forceful Manipulation with sEMG -- bring force into human videos with sEMG, for force-aware manipulation ⬇️ ✦ Zero-Shot Human-to-Robot Transfer ✦ Force Beyond Vision ✦ Free-Hand Force Sensing ✦ Collect by Anyone, Anytime, Anywhere ✦ Deploy on Any Robot, Any Camera, Any Environment ✦ Open-Source & Low-Cost & Easy-to-Implement Let's squeeze every bit of signal out of human data, and let robots feel the force! 🌐 Website: 📄 Paper: 💻 Code: 🎥 Video: 🧵 1/n

1/🧠Humans are the best robot data source — but video alone misses one thing: force. 2/🙁Tactile gloves capture force — but they're costly and block the real touch manipulation depends on. 3/💪Maybe the future of touch lives on your wrist: surface EMG reads the muscles that cause force — tactile sensing without ever touching a tactile sensor. 4/🔥Want a fully open-source framework — hardware + software — to train your own force-aware learn-from-human-data robot policy? 🚀We introduce ForceBand: Learning Forceful Manipulation with sEMG -- bring force into human videos with sEMG, for force-aware manipulation ⬇️ ✦ Zero-Shot Human-to-Robot Transfer ✦ Force Beyond Vision ✦ Free-Hand Force Sensing ✦ Collect by Anyone, Anytime, Anywhere ✦ Deploy on Any Robot, Any Camera, Any Environment ✦ Open-Source & Low-Cost & Easy-to-Implement Let's squeeze every bit of signal out of human data, and let robots feel the force! 🌐 Website: 📄 Paper: 💻 Code: 🎥 Video: 🧵 1/n

Zhi (Leo) Wang

50,541 Aufrufe • vor 1 Monat

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

The Humanoid Hub

11,575 Aufrufe • vor 5 Monaten

Tired of teleoperation? One human video → 1,000s of robot demos. (📍GitHub ) Scaling Robot Data Without Dynamics Simulation or Robot Hardware Real2Render2Real (R2R2R) is a new way to scale robot data without physics simulation or hardware. You take a phone scan + a single monocular human demo. It tracks the motion, renders photorealistic scenes, and generates diverse, robot-agnostic trajectories ready for training. > No teleop, no sim, no robot, just a phone and a video > Train VLA models and diffusion policies directly on the output > Supports multiple robot embodiments with kinematic consistency > 1000s of demos in 1/27 the time of real-world collection Thank you, Max Fu, for sharing!! Project: Paper: Code coming soon: It shows that with the right pipeline, you can scale robot learning data without touching a robot. One of the most interesting directions in scalable robotics today. —— Weekly robotics and AI insights. Subscribe free:

Tired of teleoperation? One human video → 1,000s of robot demos. (📍GitHub ) Scaling Robot Data Without Dynamics Simulation or Robot Hardware Real2Render2Real (R2R2R) is a new way to scale robot data without physics simulation or hardware. You take a phone scan + a single monocular human demo. It tracks the motion, renders photorealistic scenes, and generates diverse, robot-agnostic trajectories ready for training. > No teleop, no sim, no robot, just a phone and a video > Train VLA models and diffusion policies directly on the output > Supports multiple robot embodiments with kinematic consistency > 1000s of demos in 1/27 the time of real-world collection Thank you, Max Fu, for sharing!! Project: Paper: Code coming soon: It shows that with the right pipeline, you can scale robot learning data without touching a robot. One of the most interesting directions in scalable robotics today. —— Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

42,864 Aufrufe • vor 6 Monaten