Video yükleniyor...
Video Yüklenemedi
Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust visuomotor policies under visual occlusions. 🧵👇
122,084 görüntüleme • 1 yıl önce •via X (Twitter)
10 Yorum

Why do we need Active Perception (a.k.a a Robot Neck)? – Visual occlusion presents a significant challenge in everyday manipulation tasks. Robot wrist cameras can move with the arms, but their motion is primarily dictated by manipulation needs, rather than being driven by perceptual objectives. Here's a common failure case of a bimanual setup (without a "robot neck") in cluttered environments. 👇 2/7

Many of today's data collection systems do not capture human perceptual behaviors. The observation mismatch—between what the human sees and what the robot learns from—hinders the learning of effective manipulation policies. To see what the robot sees, we developed a VR interface—Async Teleop, which introduces decoupled view rendering to reduce motion sickness 😵💫🥴🤢 — an issue that is insufficiently addressed in prior VR teleop systems. 3/7

We train a Diffusion Policy that predicts bimanual arm actions for manipulation and neck actions that mimic human active perceptual strategies. Evaluation results show that ViA enables the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks with a single active head camera. Check out the uncut policy rollouts👇 4/7

ViA shows robust visual understanding. In the Lime & Pot task, the lime is randomly placed and often not visible at first. The robot learns to look around and search for the object first before initiating arm actions. Check the rollouts👇 5/7

Wrist camera is not all you need in cluttered environments. In our experiments, we validated that the [Chest & Wrist Cameras] fail to provide sufficient task-relevant information under visual occlusions. For example, the right wrist camera is completely occluded by the upper shelf tier during cup-grasping shown in the second row of the figure. 6/7

We’ve open-sourced everything: Arxiv: Github: Hardware: Thanks to my incredible collaborators @XiaomengXu11 @jimmyyhwu @YifanHou2. Thanks to Jeannette @leto__jean for her exceptional guidance throughout this project. Shoutout to Shuran @SongShuran for her invaluable research mentorship and unwavering support during my visit to REAL @Stanford !

Can we clarify why cameras don't just go on the "hands", or the "wrists" with a central constant camera on the spinal structure for stable overview? all of those neck parts wearing down so it can get a good look at the work, when the arms are already moving? Nah.

Suuuper cool work

Congrats Haoyu!!!

Thanks @_sam_sinha_
