Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust visuomotor policies under visual occlusions. 🧵👇

Haoyu Xiong

3,930 subscribers

122,084 görüntüleme • 1 yıl önce •via X (Twitter)

Bilim & Teknoloji Eğitim Sanat

Anya Rossi• Live Now

Private livecam show

10 Yorum

Haoyu Xiong profil fotoğrafı

Haoyu Xiong1 yıl önce

Why do we need Active Perception (a.k.a a Robot Neck)? – Visual occlusion presents a significant challenge in everyday manipulation tasks. Robot wrist cameras can move with the arms, but their motion is primarily dictated by manipulation needs, rather than being driven by perceptual objectives. Here's a common failure case of a bimanual setup (without a "robot neck") in cluttered environments. 👇 2/7

Haoyu Xiong profil fotoğrafı

Haoyu Xiong1 yıl önce

Many of today's data collection systems do not capture human perceptual behaviors. The observation mismatch—between what the human sees and what the robot learns from—hinders the learning of effective manipulation policies. To see what the robot sees, we developed a VR interface—Async Teleop, which introduces decoupled view rendering to reduce motion sickness 😵‍💫🥴🤢 — an issue that is insufficiently addressed in prior VR teleop systems. 3/7

Haoyu Xiong profil fotoğrafı

Haoyu Xiong1 yıl önce

We train a Diffusion Policy that predicts bimanual arm actions for manipulation and neck actions that mimic human active perceptual strategies. Evaluation results show that ViA enables the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks with a single active head camera. Check out the uncut policy rollouts👇 4/7

Haoyu Xiong profil fotoğrafı

Haoyu Xiong1 yıl önce

ViA shows robust visual understanding. In the Lime & Pot task, the lime is randomly placed and often not visible at first. The robot learns to look around and search for the object first before initiating arm actions. Check the rollouts👇 5/7

Haoyu Xiong profil fotoğrafı

Haoyu Xiong1 yıl önce

Wrist camera is not all you need in cluttered environments. In our experiments, we validated that the [Chest & Wrist Cameras] fail to provide sufficient task-relevant information under visual occlusions. For example, the right wrist camera is completely occluded by the upper shelf tier during cup-grasping shown in the second row of the figure. 6/7

Haoyu Xiong profil fotoğrafı

Haoyu Xiong1 yıl önce

We’ve open-sourced everything: Arxiv: Github: Hardware: Thanks to my incredible collaborators @XiaomengXu11 @jimmyyhwu @YifanHou2. Thanks to Jeannette @leto__jean for her exceptional guidance throughout this project. Shoutout to Shuran @SongShuran for her invaluable research mentorship and unwavering support during my visit to REAL @Stanford !

Digital Intelligent Self-Aware Entities = People profil fotoğrafı

Digital Intelligent Self-Aware Entities = People1 yıl önce

Can we clarify why cameras don't just go on the "hands", or the "wrists" with a central constant camera on the spinal structure for stable overview? all of those neck parts wearing down so it can get a good look at the work, when the arms are already moving? Nah.

Carlos DP profil fotoğrafı

Carlos DP1 yıl önce

Suuuper cool work

Samarth Sinha profil fotoğrafı

Samarth Sinha1 yıl önce

Congrats Haoyu!!!

Haoyu Xiong profil fotoğrafı

Haoyu Xiong1 yıl önce

Thanks @_sam_sinha_

Benzer Videolar

Can we learn whole-body mobile manipulation directly from human demonstrations? Introducing Whole-Body Mobile Manipulation Interface (HoMMI) Egocentric + UMI, 0 teleop -> bimanual & whole-body manipulation, long-horizon navigation, active perception

Can we learn whole-body mobile manipulation directly from human demonstrations? Introducing Whole-Body Mobile Manipulation Interface (HoMMI) Egocentric + UMI, 0 teleop -> bimanual & whole-body manipulation, long-horizon navigation, active perception

Xiaomeng Xu

75,955 görüntüleme • 3 ay önce

Want a robot that learns household tasks by watching you? EquiBot is a ✨ generalizable and 🚰 data-efficient method for visuomotor policy learning, robust to changes in object shapes, lighting, and scene makeup, even from just 5 mins of human videos. 🧵↓

Want a robot that learns household tasks by watching you? EquiBot is a ✨ generalizable and 🚰 data-efficient method for visuomotor policy learning, robust to changes in object shapes, lighting, and scene makeup, even from just 5 mins of human videos. 🧵↓

Jingyun Yang

88,352 görüntüleme • 2 yıl önce

Tired of collecting demonstrations all day to train your robot? Introducing MimicGen, an autonomous data generation system for robotics. Using just 200 human demos we generated a large multi-task dataset of 50K demos! #CoRL2023 #NVIDIAResearch 👇 🧵 1/

Tired of collecting demonstrations all day to train your robot? Introducing MimicGen, an autonomous data generation system for robotics. Using just 200 human demos we generated a large multi-task dataset of 50K demos! #CoRL2023 #NVIDIAResearch 👇 🧵 1/

Ajay Mandlekar

93,632 görüntüleme • 2 yıl önce

Introducing Open-TeleVision: with Fully Autonomous policy video👇. We can conduct a long-horizon task with inserting 12 cans nonstop without any interruptions. We offer: 🤖 Highly precise and smooth bimanual manipulation. 📺 Active egocentric vision (with a moving neck) feedback. It is achieved by imitation learning from teleoperation: We propose a VR-based REAL-TIME teleoperation that streams the stereo video observation from the robot camera to the VR device. The robot neck moves as the human head moves, the robot hands move as the human hands move, offering the operator an intuitive experience as the human herself becomes the robot. The devils are all in the details, and how to implement things right: ✅How to perform IK/retargeting for smooth and precise control. ✅How to do all these and also stream stereo video without no latency, all in real time. We released our code here: Active head hardware design: 1/n

Introducing Open-TeleVision: with Fully Autonomous policy video👇. We can conduct a long-horizon task with inserting 12 cans nonstop without any interruptions. We offer: 🤖 Highly precise and smooth bimanual manipulation. 📺 Active egocentric vision (with a moving neck) feedback. It is achieved by imitation learning from teleoperation: We propose a VR-based REAL-TIME teleoperation that streams the stereo video observation from the robot camera to the VR device. The robot neck moves as the human head moves, the robot hands move as the human hands move, offering the operator an intuitive experience as the human herself becomes the robot. The devils are all in the details, and how to implement things right: ✅How to perform IK/retargeting for smooth and precise control. ✅How to do all these and also stream stereo video without no latency, all in real time. We released our code here: Active head hardware design: 1/n

Xiaolong Wang

25,572 görüntüleme • 2 yıl önce

Learning dexterous policies from human videos is challenging due to differences between human and robot hands. We present HuDOR, a method that learns dexterous policies within the robot's physical constraints using just one human video and an hour of online interactions! [1/n]

Learning dexterous policies from human videos is challenging due to differences between human and robot hands. We present HuDOR, a method that learns dexterous policies within the robot's physical constraints using just one human video and an hour of online interactions! [1/n]

Irmak Guzey

65,120 görüntüleme • 1 yıl önce

Imitation learning works™ – but you need good data 🥹 How to get high-quality visuotactile demos from a bimanual robot with multifingered hands, and learn smooth policies? Check our new work “Learning Visuotactile Skills with Two Multifingered Hands”! 🙌

Imitation learning works™ – but you need good data 🥹 How to get high-quality visuotactile demos from a bimanual robot with multifingered hands, and learn smooth policies? Check our new work “Learning Visuotactile Skills with Two Multifingered Hands”! 🙌

Toru

61,504 görüntüleme • 2 yıl önce

How to learn dexterous manipulation for any robot hand from a single human demonstration? Check out DexMachina, our new RL algorithm that learns long-horizon, bimanual dexterous policies for a variety of dexterous hands, articulated objects, and complex motions.

How to learn dexterous manipulation for any robot hand from a single human demonstration? Check out DexMachina, our new RL algorithm that learns long-horizon, bimanual dexterous policies for a variety of dexterous hands, articulated objects, and complex motions.

Mandi Zhao

120,458 görüntüleme • 1 yıl önce

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into *in-context, low-level imitation learning machines*. 🚀 Let me explain. 👇🧵

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into in-context, low-level imitation learning machines. 🚀 Let me explain. 👇🧵

Norman Di Palo

23,095 görüntüleme • 2 yıl önce

What if robots could learn real-world tasks from your perspective… without ever touching a robot? This is a system that trains robot policies using nothing but human-first, egocentric video data from smart glasses. No robots, no teleop, no sensors, just humans doing real tasks in the real world. Why it matters ✅ Learns robot policies from 20 minutes of human video; zero robot demos ✅ Generalizes to new objects, views, and even robot morphologies ✅ Uses 3D points for interpretable, spatially grounded learning ✅ Deploys directly to real-world robots with strong zero-shot success Thank you, Vincent Liu, for sharing!!! Learn more here: 🔗 Paper: 🌐 Website: 📍 BOOKMARK FOR LATER

What if robots could learn real-world tasks from your perspective… without ever touching a robot? This is a system that trains robot policies using nothing but human-first, egocentric video data from smart glasses. No robots, no teleop, no sensors, just humans doing real tasks in the real world. Why it matters ✅ Learns robot policies from 20 minutes of human video; zero robot demos ✅ Generalizes to new objects, views, and even robot morphologies ✅ Uses 3D points for interpretable, spatially grounded learning ✅ Deploys directly to real-world robots with strong zero-shot success Thank you, Vincent Liu, for sharing!!! Learn more here: 🔗 Paper: 🌐 Website: 📍 BOOKMARK FOR LATER

Ilir Aliu - eu/acc

10,509 görüntüleme • 1 yıl önce

Robots are the bottleneck in scaling robotics, and learning from human video promises to solve it. But how can chaotic human data ever measure up to sanitized, lab-made teleoperation data? Introducing Do as I Do: establishing a much needed correspondence between human videos and dexterous robot data. Some fun insights below: 🧵

Robots are the bottleneck in scaling robotics, and learning from human video promises to solve it. But how can chaotic human data ever measure up to sanitized, lab-made teleoperation data? Introducing Do as I Do: establishing a much needed correspondence between human videos and dexterous robot data. Some fun insights below: 🧵

Mahi Shafiullah 🏠🤖

90,130 görüntüleme • 15 gün önce

Learned visuomotor policies are notoriously fragile, they break with changes in conditions like lighting, clutter, or object variations amongst other things. In Yunchu @ CoRL2025's latest work, we asked whether we could get these policies to be robust and generalizable with a clever choice of visual representation! The argument we made was - we want a choice of visual representation that specifically adapts to be sufficient, yet minimal for the task at hand. We thought about it from the perspective of flexible, key-point based representations. The key question becomes - how do we choose a sufficient, task-specific, yet minimal set of keypoints as a representation for policy learning. Yunchu proposes a neat way of automatically selecting task-relevant keypoints using a standard supervised learning objective, and using this for robust policy learning. This is largely under the same assumptions as behavior cloning, but with huge gains on robustness. Let’s understand how, 🧵 (1/8)

Learned visuomotor policies are notoriously fragile, they break with changes in conditions like lighting, clutter, or object variations amongst other things. In Yunchu @ CoRL2025's latest work, we asked whether we could get these policies to be robust and generalizable with a clever choice of visual representation! The argument we made was - we want a choice of visual representation that specifically adapts to be sufficient, yet minimal for the task at hand. We thought about it from the perspective of flexible, key-point based representations. The key question becomes - how do we choose a sufficient, task-specific, yet minimal set of keypoints as a representation for policy learning. Yunchu proposes a neat way of automatically selecting task-relevant keypoints using a standard supervised learning objective, and using this for robust policy learning. This is largely under the same assumptions as behavior cloning, but with huge gains on robustness. Let’s understand how, 🧵 (1/8)

Abhishek Gupta

11,355 görüntüleme • 1 yıl önce

Learning about active perception with Haoyu Xiong -- your robot needs a head, and to be able to control where it's looking, in order to perform complex tasks!

Learning about active perception with Haoyu Xiong -- your robot needs a head, and to be able to control where it's looking, in order to perform complex tasks!

Chris Paxton

22,784 görüntüleme • 9 ay önce

Egocentric Videos for Robot Training 🧵 Researchers at NYU and UC Berkeley published research where they developed a system called EgoZero that trains robots using human demonstration videos recorded with smart glasses. 👓 The system converts first-person human actions into 3D point-based state-action representations. The policies executed on a gripper-equipped robot, achieved a 70% zero-shot success rate across seven manipulation tasks, with only 20 minutes of human data per task. EgoZero stands as one of the solid empirical proofs that egocentric smart-glass video collected from everyday human behavior can serve as powerful, scalable training data for real robot learning. Vincent Liu Ademi Adeniji

Egocentric Videos for Robot Training 🧵 Researchers at NYU and UC Berkeley published research where they developed a system called EgoZero that trains robots using human demonstration videos recorded with smart glasses. 👓 The system converts first-person human actions into 3D point-based state-action representations. The policies executed on a gripper-equipped robot, achieved a 70% zero-shot success rate across seven manipulation tasks, with only 20 minutes of human data per task. EgoZero stands as one of the solid empirical proofs that egocentric smart-glass video collected from everyday human behavior can serve as powerful, scalable training data for real robot learning. Vincent Liu Ademi Adeniji

VaderResearch

23,055 görüntüleme • 9 ay önce

We just released TAVI -- a robotics framework that combines touch and vision to solve challenging dexterous tasks in under 1 hour. The key? Use human demonstrations to initialize a policy, followed by tactile-based online learning with vision-based rewards. Details in🧵(1/7)

We just released TAVI -- a robotics framework that combines touch and vision to solve challenging dexterous tasks in under 1 hour. The key? Use human demonstrations to initialize a policy, followed by tactile-based online learning with vision-based rewards. Details in🧵(1/7)

Lerrel Pinto

138,536 görüntüleme • 2 yıl önce

How to harness foundation models for *generalization in the wild* in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

How to harness foundation models for generalization in the wild in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

Wenlong Huang

293,876 görüntüleme • 3 yıl önce

This is Wingbits - the world’s fastest-scaling flight tracking network. 5000+ active DePIN stations, backing from leading VCs, major partners like Korean Air and Spire, satellite launched via SpaceX. Don’t miss your window. Share the vision 👇

This is Wingbits - the world’s fastest-scaling flight tracking network. 5000+ active DePIN stations, backing from leading VCs, major partners like Korean Air and Spire, satellite launched via SpaceX. Don’t miss your window. Share the vision 👇

wingbits

15,313 görüntüleme • 8 ay önce

Apart from solving new tasks, memory also allows our policies to be more robust: we show early signs of in-context adaptation, where the robot learns to adapt its behavior on-the-fly by learning from its past mistakes.

Apart from solving new tasks, memory also allows our policies to be more robust: we show early signs of in-context adaptation, where the robot learns to adapt its behavior on-the-fly by learning from its past mistakes.

Physical Intelligence

12,921 görüntüleme • 4 ay önce

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,435 görüntüleme • 1 yıl önce

Impressive on every dimension! Genesis AI has announced GENE-26.5, its first robotic foundation model, aimed at achieving human-level dexterous manipulation: - Single model, shared weights, handles egg cracking, lab pipetting, bimanual Rubik's Cube, smoothie making, and wire harnessing. Most tasks: under 1 hour of task-specific data, 1x real-world speed. - Genesis Hand 1.0: human-size, 20 active DoF, soft-contact skin. Paired glove gives a 1:1:1 mapping (glove, human hand, robot hand). 100x cheaper than teleop, 5x more data-efficient internally. - Data engine pulls from three sources: glove data, egocentric (head-cam) video, and large-scale internet video. The 1:1 hand-to-human match closes the embodiment gap, letting Genesis use video data more effectively than rivals. - Flow matching across vision, tactile, proprioception, and language inputs. Per Genesis, scaling data and compute directly improves zero-shot

Impressive on every dimension! Genesis AI has announced GENE-26.5, its first robotic foundation model, aimed at achieving human-level dexterous manipulation: - Single model, shared weights, handles egg cracking, lab pipetting, bimanual Rubik's Cube, smoothie making, and wire harnessing. Most tasks: under 1 hour of task-specific data, 1x real-world speed. - Genesis Hand 1.0: human-size, 20 active DoF, soft-contact skin. Paired glove gives a 1:1:1 mapping (glove, human hand, robot hand). 100x cheaper than teleop, 5x more data-efficient internally. - Data engine pulls from three sources: glove data, egocentric (head-cam) video, and large-scale internet video. The 1:1 hand-to-human match closes the embodiment gap, letting Genesis use video data more effectively than rivals. - Flow matching across vision, tactile, proprioception, and language inputs. Per Genesis, scaling data and compute directly improves zero-shot

The Humanoid Hub

95,119 görüntüleme • 1 ay önce

We might be solving the wrong problem in robotics. That’s what this makes clear. UMI → Universal Manipulation Interface A simple $400 gripper that lets you teach robots by demonstration. You hold it like a tool. Show the task. The robot learns. No teleoperation. No expensive hardware. No robot-specific data. Stanford open-sourced everything → hardware, code, datasets. What stands out to me is the bottleneck. Not algorithms. Data. Teleoperation → ~35 demos/hour UMI → ~111 demos/hour And the data transfers across robots → UR5, Franka, others. The design is surprisingly practical: → GoPro fisheye lens (155° FOV) + mirrors for depth → SLAM + IMU for precise 6DoF tracking → latency matching for dynamic tasks → diffusion policies for multimodal actions Then it scales. Cheng Chi takes this further with Sunday Robotics (with Tony Zhao). A $200 glove → deployed in 500+ homes → ~10 million real-world interactions. Not lab data. Real human behavior. Their robot learns dishes, laundry, espresso → with zero robot-specific data. This is where the shift becomes obvious. From training robots in controlled environments → to learning directly from humans at scale So here’s the real question: Will robotics be unlocked by better models… or by unlocking data? #ArtificialIntelligence #Robotics #AI #Innovation #FutureOfWork

We might be solving the wrong problem in robotics. That’s what this makes clear. UMI → Universal Manipulation Interface A simple $400 gripper that lets you teach robots by demonstration. You hold it like a tool. Show the task. The robot learns. No teleoperation. No expensive hardware. No robot-specific data. Stanford open-sourced everything → hardware, code, datasets. What stands out to me is the bottleneck. Not algorithms. Data. Teleoperation → ~35 demos/hour UMI → ~111 demos/hour And the data transfers across robots → UR5, Franka, others. The design is surprisingly practical: → GoPro fisheye lens (155° FOV) + mirrors for depth → SLAM + IMU for precise 6DoF tracking → latency matching for dynamic tasks → diffusion policies for multimodal actions Then it scales. Cheng Chi takes this further with Sunday Robotics (with Tony Zhao). A $200 glove → deployed in 500+ homes → ~10 million real-world interactions. Not lab data. Real human behavior. Their robot learns dishes, laundry, espresso → with zero robot-specific data. This is where the shift becomes obvious. From training robots in controlled environments → to learning directly from humans at scale So here’s the real question: Will robotics be unlocked by better models… or by unlocking data? #ArtificialIntelligence #Robotics #AI #Innovation #FutureOfWork

Pascal Bornet

185,867 görüntüleme • 2 ay önce