Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

How to scale visual affordance learning that is fine-grained, task-conditioned, works in-the-wild, in dynamic envs? Introducing Unsupervised Affordance Distillation (UAD): distills affordances from off-the-shelf foundation models, all without manual labels. Very excited this is nominated as Best Paper Finalist at #ICRA2025! 🧵👇

Wenlong Huang

5,661 subscribers

93,552 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

11 Kommentare

Profilbild von Wenlong Huang

Wenlong Huangvor 1 Jahr

Visual affordance allows robots to perceive actionable opportunities in an env, crucial for manipulation. We formulate affordance as language-conditioned pixel-level continuous probabilities, from identifying exact grasp point on handles, to where to press pumps & hold scissors.

Profilbild von Wenlong Huang

Wenlong Huangvor 1 Jahr

Yet scaling affordance is tough due to fine-grained labels. Our solution: automate labeling w/ vision and language foundation models (DINOv2 & GPT-4o) on sim-rendered 3D assets, enabling easy scaling to 10K+ object-query pairs (BEHAVIOR & Objaverse), all without human efforts.

Profilbild von Wenlong Huang

Wenlong Huangvor 1 Jahr

We first perform multi-view DINOv2 feature fusion for rendered 3D assets, cluster them, and then visually prompt VLMs to “brainstorm” associated tasks and identify relevant regions, where associated features are convolved over fused 3D features to obtain continuous annotations.

Profilbild von Wenlong Huang

Wenlong Huangvor 1 Jahr

We then train text-conditioned layers on top of DINOv2 – a key design enabling *zero-shot generalization* to complex real-world scenes despite trained only in sim. Intuitively, this connects self-supervised features that capture rich geometric structures to diverse task semantics.

Profilbild von Wenlong Huang

Wenlong Huangvor 1 Jahr

Compared to CLIP & open-vocab detectors, affordance stands out as continuous, fine-grained, manipulation-centric alternative. Surprisingly, it works on some unseen human activities too! With >200 Hz inference, it also runs on videos taken in the lab & Airbnb w/ hand-held camera.

Profilbild von Wenlong Huang

Wenlong Huangvor 1 Jahr

As a task-conditioned visual representation, it notably improves generalization in manipulation, especially text-following behaviors. Policies learned w/ 10 demos not only generalize to novel poses, instances, categories, but also to unseen instructions, all evaluated zero-shot.

Profilbild von Wenlong Huang

Wenlong Huangvor 1 Jahr

Check out our interactive demos and try your own images and prompts! The work is not possible without the great effort led by @Yihe_yihe and by the rest of the team: Yingke Wang @ChengshuEricLi Roy Yuan @RuohanZhang76 @jiajunwu_cs @drfeifei.

Profilbild von Wenlong Huang

Wenlong Huangvor 1 Jahr

For more, check out: Website: Paper: Demo: Code: Full code and dataset will be released in the coming weeks.

Profilbild von Power Homeschool

Power Homeschoolvor 1 Jahr

The Acellus® Learning System automates much of the busy-work associated with grading & keeping records so that parents can focus on what matters most–helping your child succeed. Enroll now! ⬇️

Profilbild von Yixuan Wang

Yixuan Wangvor 1 Jahr

Congrats! Very awesome work!!

Profilbild von Wenlong Huang

Wenlong Huangvor 1 Jahr

Thank you Yixuan!

Ähnliche Videos

Excited to release RT-Affordance! We propose conditioning policies on visual affordance plans as an intermediate representation that allows us to learn new tasks without collecting any new robot trajectories. Website and paper: Here’s a short 🧵

Excited to release RT-Affordance! We propose conditioning policies on visual affordance plans as an intermediate representation that allows us to learn new tasks without collecting any new robot trajectories. Website and paper: Here’s a short 🧵

Soroush Nasiriany

27,484 Aufrufe • vor 1 Jahr

How to harness foundation models for *generalization in the wild* in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

How to harness foundation models for generalization in the wild in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

Wenlong Huang

293,876 Aufrufe • vor 3 Jahren

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into *in-context, low-level imitation learning machines*. 🚀 Let me explain. 👇🧵

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into in-context, low-level imitation learning machines. 🚀 Let me explain. 👇🧵

Norman Di Palo

23,088 Aufrufe • vor 2 Jahren

Introducing Boltus: The God of AI ⚡ Binge 🍿 all four episodes in this 🧵! Let Boltus teach you how to deploy diffusion models at scale 👉👉 (1/4)

Introducing Boltus: The God of AI ⚡ Binge 🍿 all four episodes in this 🧵! Let Boltus teach you how to deploy diffusion models at scale 👉👉 (1/4)

Lightning AI ⚡️

55,675 Aufrufe • vor 3 Jahren

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Ilia

32,956 Aufrufe • vor 9 Monaten

Introducing an approach to directly ground video generation models to policy execution without needing any action labels! Our approach uses a generic goal-conditioned exploration procedure to learn a policy that works across robots / embodiments!

Introducing an approach to directly ground video generation models to policy execution without needing any action labels! Our approach uses a generic goal-conditioned exploration procedure to learn a policy that works across robots / embodiments!

Yilun Du

21,392 Aufrufe • vor 1 Jahr

Too many scrollable list that are not scrolled. Exploring the best technique to create affordance that there's more on the scroll. Best example you've been out there?

Too many scrollable list that are not scrolled. Exploring the best technique to create affordance that there's more on the scroll. Best example you've been out there?

Micka

79,522 Aufrufe • vor 16 Tagen

📢 Introducing DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models Compared to vanilla DPO, we improve paired data construction and preference label granularity, leading to better visual quality and motion strength with only 1/3 of the data. 🧵

📢 Introducing DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models Compared to vanilla DPO, we improve paired data construction and preference label granularity, leading to better visual quality and motion strength with only 1/3 of the data. 🧵

Ziyi Wu

35,402 Aufrufe • vor 1 Jahr

$1\ 8 months ago Matt Gaetz appeared on Tim Pool just after the vote for Speaker. He gave a very insightful 20 minutes in the opener that I’m going to thread and bring context. This is very important IMO to dust off the shelf 🧵 👇$

1\ 8 months ago Matt Gaetz appeared on Tim Pool just after the vote for Speaker. He gave a very insightful 20 minutes in the opener that I’m going to thread and bring context. This is very important IMO to dust off the shelf 🧵 👇

TrashDiscourse

1,565,616 Aufrufe • vor 2 Jahren

Ange Postecoglou's automatized chance creation method aiming to exploit space at back post with low driven cross & space vacated at the edge in the box when defenders retreat with cutback with affordance for rebounds maximizing luck. 🧵Training to match:

Ange Postecoglou's automatized chance creation method aiming to exploit space at back post with low driven cross & space vacated at the edge in the box when defenders retreat with cutback with affordance for rebounds maximizing luck. 🧵Training to match:

'

21,799 Aufrufe • vor 1 Jahr

This is a single uncut video, showing a robot learning several tasks instantly, after just one demonstration each ... This is possible because we've now been able to achieve in-context learning for everyday robotics tasks, and I'm very excited to announce our latest paper: 🎆 Instant Policy: In-Context Imitation Learning via Graph Diffusion 🎆 (1/6) 🧵👇

This is a single uncut video, showing a robot learning several tasks instantly, after just one demonstration each ... This is possible because we've now been able to achieve in-context learning for everyday robotics tasks, and I'm very excited to announce our latest paper: 🎆 Instant Policy: In-Context Imitation Learning via Graph Diffusion 🎆 (1/6) 🧵👇

Edward Johns

74,663 Aufrufe • vor 1 Jahr

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

kyutai

52,598 Aufrufe • vor 1 Monat

What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇

What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇

Wenlong Huang @ CVPR

190,836 Aufrufe • vor 1 Jahr

Continual learning sometimes gets discussed as if the goal is to dissolve the context/weights distinction. Let the model just keep accumulating, fine-tuning itself on the fly. Andrej Karpathy points out, though, that this isn't how humans do it. Our working memory gets wiped regularly. What we actually have is a consolidation process (sleep) that distills stuff into the brain, in a weird and lossy way. This is very different from how people sometimes talk about continual learning. It's not obvious it's something you can get for free from doing long enough RL loops.

Continual learning sometimes gets discussed as if the goal is to dissolve the context/weights distinction. Let the model just keep accumulating, fine-tuning itself on the fly. Andrej Karpathy points out, though, that this isn't how humans do it. Our working memory gets wiped regularly. What we actually have is a consolidation process (sleep) that distills stuff into the brain, in a weird and lossy way. This is very different from how people sometimes talk about continual learning. It's not obvious it's something you can get for free from doing long enough RL loops.

Dwarkesh Patel

59,462 Aufrufe • vor 1 Monat

I had an awesome time interviewing idan shenfeld and Jonas Hübotter from MIT and ETH Zurich about self-distillation. this very promising post-training paradigm where the model acts as its own teacher by conditioning on environment feedback or demonstrations. we cover the SDPO algo for reinforcement learning with rich feedback and SDFT for continual learning without forgetting along with many applications. we dig into how it works, why it's simpler and faster than GRPO, and where this is already showing up in production systems. table of content: 0:00 - what is self distillation 2:50 - idan (MIT) and jonas (ETH Zurich) introduction and motivation 18:40 - different perspective of on-policy self-distillation (presentation) 36:00 - metacognition and specificity in self-distillation 37:24 - very long hard task and self-distillation 42:00 - continual learning with self-distillation (presentation) 1:16:50 - what is next in this research direction? 1:20:00 - is there any experience with subjective feedbacks? 1:22:50 - quality vs number of feedbacks? 1:26:40 - what setting would self-distillation struggle vs GRPO? the slides were super crisp really cool of them to share! enjoy my guys 🌹

I had an awesome time interviewing idan shenfeld and Jonas Hübotter from MIT and ETH Zurich about self-distillation. this very promising post-training paradigm where the model acts as its own teacher by conditioning on environment feedback or demonstrations. we cover the SDPO algo for reinforcement learning with rich feedback and SDFT for continual learning without forgetting along with many applications. we dig into how it works, why it's simpler and faster than GRPO, and where this is already showing up in production systems. table of content: 0:00 - what is self distillation 2:50 - idan (MIT) and jonas (ETH Zurich) introduction and motivation 18:40 - different perspective of on-policy self-distillation (presentation) 36:00 - metacognition and specificity in self-distillation 37:24 - very long hard task and self-distillation 42:00 - continual learning with self-distillation (presentation) 1:16:50 - what is next in this research direction? 1:20:00 - is there any experience with subjective feedbacks? 1:22:50 - quality vs number of feedbacks? 1:26:40 - what setting would self-distillation struggle vs GRPO? the slides were super crisp really cool of them to share! enjoy my guys 🌹

Yacine Mahdid

12,945 Aufrufe • vor 1 Monat

The era of manual ML is over. Introducing NEO Neo AI the first Agentic ML Engineer. A system of 11 AI agents that designs, codes, and ships ML models automatically. Join the waitlist: Here’s how it works 👇 (VIDEO)

The era of manual ML is over. Introducing NEO Neo AI the first Agentic ML Engineer. A system of 11 AI agents that designs, codes, and ships ML models automatically. Join the waitlist: Here’s how it works 👇 (VIDEO)

SARAH

309,835 Aufrufe • vor 10 Monaten

Today, we are excited to share our performance of “Alien” from our headline show in Los Angeles this past summer. This is a song that is very near and dear to us. The last song off of our 2022 album, “In The Wild”. Our first ballad.

Today, we are excited to share our performance of “Alien” from our headline show in Los Angeles this past summer. This is a song that is very near and dear to us. The last song off of our 2022 album, “In The Wild”. Our first ballad.

The Interrupters

30,442 Aufrufe • vor 2 Jahren

We also present another paper at @SIGGRAPH 2023 on neural implicit 3D Morphable Models that can be used to create a dynamic 3D avatar from a single in-the-wild image. (Lead author Connor Lin).

Koki Nagano

12,758 Aufrufe • vor 3 Jahren

🇺🇸 DAVID SACKS: AI THINKS WHITE MEN ARE WORTH LESS "I think what the paper purports to show is that almost all of these models, except for maybe Grok, view whites as less valuable than non-whites. If the paper is true, this is very concerning." Source: The All-In Podcast, David Sacks

🇺🇸 DAVID SACKS: AI THINKS WHITE MEN ARE WORTH LESS "I think what the paper purports to show is that almost all of these models, except for maybe Grok, view whites as less valuable than non-whites. If the paper is true, this is very concerning." Source: The All-In Podcast, David Sacks

Mario Nawfal

27,297 Aufrufe • vor 7 Monaten