正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

How to scale visual affordance learning that is fine-grained, task-conditioned, works in-the-wild, in dynamic envs? Introducing Unsupervised Affordance Distillation (UAD): distills affordances from off-the-shelf foundation models, all without manual labels. Very excited this is nominated as Best Paper Finalist at #ICRA2025! 🧵👇

Wenlong Huang

5,661 subscribers

93,552 次观看 • 1 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

11 条评论

Wenlong Huang 的头像

Wenlong Huang1 年前

Visual affordance allows robots to perceive actionable opportunities in an env, crucial for manipulation. We formulate affordance as language-conditioned pixel-level continuous probabilities, from identifying exact grasp point on handles, to where to press pumps & hold scissors.

Wenlong Huang 的头像

Wenlong Huang1 年前

Yet scaling affordance is tough due to fine-grained labels. Our solution: automate labeling w/ vision and language foundation models (DINOv2 & GPT-4o) on sim-rendered 3D assets, enabling easy scaling to 10K+ object-query pairs (BEHAVIOR & Objaverse), all without human efforts.

Wenlong Huang 的头像

Wenlong Huang1 年前

We first perform multi-view DINOv2 feature fusion for rendered 3D assets, cluster them, and then visually prompt VLMs to “brainstorm” associated tasks and identify relevant regions, where associated features are convolved over fused 3D features to obtain continuous annotations.

Wenlong Huang 的头像

Wenlong Huang1 年前

We then train text-conditioned layers on top of DINOv2 – a key design enabling *zero-shot generalization* to complex real-world scenes despite trained only in sim. Intuitively, this connects self-supervised features that capture rich geometric structures to diverse task semantics.

Wenlong Huang 的头像

Wenlong Huang1 年前

Compared to CLIP & open-vocab detectors, affordance stands out as continuous, fine-grained, manipulation-centric alternative. Surprisingly, it works on some unseen human activities too! With >200 Hz inference, it also runs on videos taken in the lab & Airbnb w/ hand-held camera.

Wenlong Huang 的头像

Wenlong Huang1 年前

As a task-conditioned visual representation, it notably improves generalization in manipulation, especially text-following behaviors. Policies learned w/ 10 demos not only generalize to novel poses, instances, categories, but also to unseen instructions, all evaluated zero-shot.

Wenlong Huang 的头像

Wenlong Huang1 年前

Check out our interactive demos and try your own images and prompts! The work is not possible without the great effort led by @Yihe_yihe and by the rest of the team: Yingke Wang @ChengshuEricLi Roy Yuan @RuohanZhang76 @jiajunwu_cs @drfeifei.

Wenlong Huang 的头像

Wenlong Huang1 年前

For more, check out: Website: Paper: Demo: Code: Full code and dataset will be released in the coming weeks.

Power Homeschool 的头像

Power Homeschool1 年前

The Acellus® Learning System automates much of the busy-work associated with grading & keeping records so that parents can focus on what matters most–helping your child succeed. Enroll now! ⬇️

Yixuan Wang 的头像

Yixuan Wang1 年前

Congrats! Very awesome work!!

Wenlong Huang 的头像

Wenlong Huang1 年前

Thank you Yixuan!

相关视频

Excited to release RT-Affordance! We propose conditioning policies on visual affordance plans as an intermediate representation that allows us to learn new tasks without collecting any new robot trajectories. Website and paper: Here’s a short 🧵

Excited to release RT-Affordance! We propose conditioning policies on visual affordance plans as an intermediate representation that allows us to learn new tasks without collecting any new robot trajectories. Website and paper: Here’s a short 🧵

Soroush Nasiriany

27,495 次观看 • 1 年前

How to harness foundation models for *generalization in the wild* in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

How to harness foundation models for generalization in the wild in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

Wenlong Huang

293,876 次观看 • 3 年前

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into *in-context, low-level imitation learning machines*. 🚀 Let me explain. 👇🧵

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into in-context, low-level imitation learning machines. 🚀 Let me explain. 👇🧵

Norman Di Palo

23,095 次观看 • 2 年前

Introducing Boltus: The God of AI ⚡ Binge 🍿 all four episodes in this 🧵! Let Boltus teach you how to deploy diffusion models at scale 👉👉 (1/4)

Introducing Boltus: The God of AI ⚡ Binge 🍿 all four episodes in this 🧵! Let Boltus teach you how to deploy diffusion models at scale 👉👉 (1/4)

Lightning AI ⚡️

55,675 次观看 • 3 年前

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Ilia

32,956 次观看 • 9 个月前

Too many scrollable list that are not scrolled. Exploring the best technique to create affordance that there's more on the scroll. Best example you've been out there?

Too many scrollable list that are not scrolled. Exploring the best technique to create affordance that there's more on the scroll. Best example you've been out there?

Micka

79,522 次观看 • 26 天前

Introducing an approach to directly ground video generation models to policy execution without needing any action labels! Our approach uses a generic goal-conditioned exploration procedure to learn a policy that works across robots / embodiments!

Introducing an approach to directly ground video generation models to policy execution without needing any action labels! Our approach uses a generic goal-conditioned exploration procedure to learn a policy that works across robots / embodiments!

Yilun Du

21,392 次观看 • 1 年前

📢 Introducing DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models Compared to vanilla DPO, we improve paired data construction and preference label granularity, leading to better visual quality and motion strength with only 1/3 of the data. 🧵

📢 Introducing DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models Compared to vanilla DPO, we improve paired data construction and preference label granularity, leading to better visual quality and motion strength with only 1/3 of the data. 🧵

Ziyi Wu

35,402 次观看 • 1 年前

$1\ 8 months ago Matt Gaetz appeared on Tim Pool just after the vote for Speaker. He gave a very insightful 20 minutes in the opener that I’m going to thread and bring context. This is very important IMO to dust off the shelf 🧵 👇$

1\ 8 months ago Matt Gaetz appeared on Tim Pool just after the vote for Speaker. He gave a very insightful 20 minutes in the opener that I’m going to thread and bring context. This is very important IMO to dust off the shelf 🧵 👇

TrashDiscourse

1,565,616 次观看 • 2 年前

This is a single uncut video, showing a robot learning several tasks instantly, after just one demonstration each ... This is possible because we've now been able to achieve in-context learning for everyday robotics tasks, and I'm very excited to announce our latest paper: 🎆 Instant Policy: In-Context Imitation Learning via Graph Diffusion 🎆 (1/6) 🧵👇

This is a single uncut video, showing a robot learning several tasks instantly, after just one demonstration each ... This is possible because we've now been able to achieve in-context learning for everyday robotics tasks, and I'm very excited to announce our latest paper: 🎆 Instant Policy: In-Context Imitation Learning via Graph Diffusion 🎆 (1/6) 🧵👇

Edward Johns

74,680 次观看 • 1 年前

Ange Postecoglou's automatized chance creation method aiming to exploit space at back post with low driven cross & space vacated at the edge in the box when defenders retreat with cutback with affordance for rebounds maximizing luck. 🧵Training to match:

Ange Postecoglou's automatized chance creation method aiming to exploit space at back post with low driven cross & space vacated at the edge in the box when defenders retreat with cutback with affordance for rebounds maximizing luck. 🧵Training to match:

'

21,799 次观看 • 1 年前

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

kyutai

52,598 次观看 • 2 个月前

What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇

What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇

Wenlong Huang

190,887 次观看 • 1 年前

Continual learning sometimes gets discussed as if the goal is to dissolve the context/weights distinction. Let the model just keep accumulating, fine-tuning itself on the fly. Andrej Karpathy points out, though, that this isn't how humans do it. Our working memory gets wiped regularly. What we actually have is a consolidation process (sleep) that distills stuff into the brain, in a weird and lossy way. This is very different from how people sometimes talk about continual learning. It's not obvious it's something you can get for free from doing long enough RL loops.

Continual learning sometimes gets discussed as if the goal is to dissolve the context/weights distinction. Let the model just keep accumulating, fine-tuning itself on the fly. Andrej Karpathy points out, though, that this isn't how humans do it. Our working memory gets wiped regularly. What we actually have is a consolidation process (sleep) that distills stuff into the brain, in a weird and lossy way. This is very different from how people sometimes talk about continual learning. It's not obvious it's something you can get for free from doing long enough RL loops.

Dwarkesh Patel

59,763 次观看 • 1 个月前

I had an awesome time interviewing idan shenfeld and Jonas Hübotter from MIT and ETH Zurich about self-distillation. this very promising post-training paradigm where the model acts as its own teacher by conditioning on environment feedback or demonstrations. we cover the SDPO algo for reinforcement learning with rich feedback and SDFT for continual learning without forgetting along with many applications. we dig into how it works, why it's simpler and faster than GRPO, and where this is already showing up in production systems. table of content: 0:00 - what is self distillation 2:50 - idan (MIT) and jonas (ETH Zurich) introduction and motivation 18:40 - different perspective of on-policy self-distillation (presentation) 36:00 - metacognition and specificity in self-distillation 37:24 - very long hard task and self-distillation 42:00 - continual learning with self-distillation (presentation) 1:16:50 - what is next in this research direction? 1:20:00 - is there any experience with subjective feedbacks? 1:22:50 - quality vs number of feedbacks? 1:26:40 - what setting would self-distillation struggle vs GRPO? the slides were super crisp really cool of them to share! enjoy my guys 🌹

I had an awesome time interviewing idan shenfeld and Jonas Hübotter from MIT and ETH Zurich about self-distillation. this very promising post-training paradigm where the model acts as its own teacher by conditioning on environment feedback or demonstrations. we cover the SDPO algo for reinforcement learning with rich feedback and SDFT for continual learning without forgetting along with many applications. we dig into how it works, why it's simpler and faster than GRPO, and where this is already showing up in production systems. table of content: 0:00 - what is self distillation 2:50 - idan (MIT) and jonas (ETH Zurich) introduction and motivation 18:40 - different perspective of on-policy self-distillation (presentation) 36:00 - metacognition and specificity in self-distillation 37:24 - very long hard task and self-distillation 42:00 - continual learning with self-distillation (presentation) 1:16:50 - what is next in this research direction? 1:20:00 - is there any experience with subjective feedbacks? 1:22:50 - quality vs number of feedbacks? 1:26:40 - what setting would self-distillation struggle vs GRPO? the slides were super crisp really cool of them to share! enjoy my guys 🌹

Yacine Mahdid

12,988 次观看 • 2 个月前

The era of manual ML is over. Introducing NEO Neo AI the first Agentic ML Engineer. A system of 11 AI agents that designs, codes, and ships ML models automatically. Join the waitlist: Here’s how it works 👇 (VIDEO)

The era of manual ML is over. Introducing NEO Neo AI the first Agentic ML Engineer. A system of 11 AI agents that designs, codes, and ships ML models automatically. Join the waitlist: Here’s how it works 👇 (VIDEO)

SARAH

309,835 次观看 • 11 个月前

Today, we are excited to share our performance of “Alien” from our headline show in Los Angeles this past summer. This is a song that is very near and dear to us. The last song off of our 2022 album, “In The Wild”. Our first ballad.

Today, we are excited to share our performance of “Alien” from our headline show in Los Angeles this past summer. This is a song that is very near and dear to us. The last song off of our 2022 album, “In The Wild”. Our first ballad.

The Interrupters

30,442 次观看 • 2 年前

We also present another paper at @SIGGRAPH 2023 on neural implicit 3D Morphable Models that can be used to create a dynamic 3D avatar from a single in-the-wild image. (Lead author Connor Lin).

Koki Nagano

12,758 次观看 • 3 年前

Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust visuomotor policies under visual occlusions. 🧵👇

Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust visuomotor policies under visual occlusions. 🧵👇

Haoyu Xiong

122,084 次观看 • 1 年前