Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

How to scale visual affordance learning that is fine-grained, task-conditioned, works in-the-wild, in dynamic envs? Introducing Unsupervised Affordance Distillation (UAD): distills affordances from off-the-shelf foundation models, *all without manual labels*. Very excited this is nominated as Best Paper Finalist at #ICRA2025! 🧵👇

93,552 görüntüleme • 1 yıl önce •via X (Twitter)

11 Yorum

Wenlong Huang profil fotoğrafı
Wenlong Huang1 yıl önce

Visual affordance allows robots to perceive actionable opportunities in an env, crucial for manipulation. We formulate affordance as language-conditioned pixel-level continuous probabilities, from identifying exact grasp point on handles, to where to press pumps & hold scissors.

Wenlong Huang profil fotoğrafı
Wenlong Huang1 yıl önce

Yet scaling affordance is tough due to fine-grained labels. Our solution: automate labeling w/ vision and language foundation models (DINOv2 & GPT-4o) on sim-rendered 3D assets, enabling easy scaling to 10K+ object-query pairs (BEHAVIOR & Objaverse), all without human efforts.

Wenlong Huang profil fotoğrafı
Wenlong Huang1 yıl önce

We first perform multi-view DINOv2 feature fusion for rendered 3D assets, cluster them, and then visually prompt VLMs to “brainstorm” associated tasks and identify relevant regions, where associated features are convolved over fused 3D features to obtain continuous annotations.

Wenlong Huang profil fotoğrafı
Wenlong Huang1 yıl önce

We then train text-conditioned layers on top of DINOv2 – a key design enabling *zero-shot generalization* to complex real-world scenes despite trained only in sim. Intuitively, this connects self-supervised features that capture rich geometric structures to diverse task semantics.

Wenlong Huang profil fotoğrafı
Wenlong Huang1 yıl önce

Compared to CLIP & open-vocab detectors, affordance stands out as continuous, fine-grained, manipulation-centric alternative. Surprisingly, it works on some unseen human activities too! With >200 Hz inference, it also runs on videos taken in the lab & Airbnb w/ hand-held camera.

Wenlong Huang profil fotoğrafı
Wenlong Huang1 yıl önce

As a task-conditioned visual representation, it notably improves generalization in manipulation, especially text-following behaviors. Policies learned w/ 10 demos not only generalize to novel poses, instances, categories, but also to unseen instructions, all evaluated zero-shot.

Wenlong Huang profil fotoğrafı
Wenlong Huang1 yıl önce

Check out our interactive demos and try your own images and prompts! The work is not possible without the great effort led by @Yihe_yihe and by the rest of the team: Yingke Wang @ChengshuEricLi Roy Yuan @RuohanZhang76 @jiajunwu_cs @drfeifei.

Wenlong Huang profil fotoğrafı
Wenlong Huang1 yıl önce

For more, check out: Website: Paper: Demo: Code: Full code and dataset will be released in the coming weeks.

Power Homeschool profil fotoğrafı
Power Homeschool1 yıl önce

The Acellus® Learning System automates much of the busy-work associated with grading & keeping records so that parents can focus on what matters most–helping your child succeed. Enroll now! ⬇️

Yixuan Wang profil fotoğrafı
Yixuan Wang1 yıl önce

Congrats! Very awesome work!!

Wenlong Huang profil fotoğrafı
Wenlong Huang1 yıl önce

Thank you Yixuan!

Benzer Videolar