Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

How to scale visual affordance learning that is fine-grained, task-conditioned, works in-the-wild, in dynamic envs? Introducing Unsupervised Affordance Distillation (UAD): distills affordances from off-the-shelf foundation models, all without manual labels. Very excited this is nominated as Best Paper Finalist at #ICRA2025! 🧵👇

Wenlong Huang

5,661 subscribers

93,552 просмотров • 1 год назад •via X (Twitter)

Наука и технологии Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 11

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

Visual affordance allows robots to perceive actionable opportunities in an env, crucial for manipulation. We formulate affordance as language-conditioned pixel-level continuous probabilities, from identifying exact grasp point on handles, to where to press pumps & hold scissors.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

Yet scaling affordance is tough due to fine-grained labels. Our solution: automate labeling w/ vision and language foundation models (DINOv2 & GPT-4o) on sim-rendered 3D assets, enabling easy scaling to 10K+ object-query pairs (BEHAVIOR & Objaverse), all without human efforts.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

We first perform multi-view DINOv2 feature fusion for rendered 3D assets, cluster them, and then visually prompt VLMs to “brainstorm” associated tasks and identify relevant regions, where associated features are convolved over fused 3D features to obtain continuous annotations.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

We then train text-conditioned layers on top of DINOv2 – a key design enabling *zero-shot generalization* to complex real-world scenes despite trained only in sim. Intuitively, this connects self-supervised features that capture rich geometric structures to diverse task semantics.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

Compared to CLIP & open-vocab detectors, affordance stands out as continuous, fine-grained, manipulation-centric alternative. Surprisingly, it works on some unseen human activities too! With >200 Hz inference, it also runs on videos taken in the lab & Airbnb w/ hand-held camera.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

As a task-conditioned visual representation, it notably improves generalization in manipulation, especially text-following behaviors. Policies learned w/ 10 demos not only generalize to novel poses, instances, categories, but also to unseen instructions, all evaluated zero-shot.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

Check out our interactive demos and try your own images and prompts! The work is not possible without the great effort led by @Yihe_yihe and by the rest of the team: Yingke Wang @ChengshuEricLi Roy Yuan @RuohanZhang76 @jiajunwu_cs @drfeifei.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

For more, check out: Website: Paper: Demo: Code: Full code and dataset will be released in the coming weeks.

Фото профиля Power Homeschool

Power Homeschool1 год назад

The Acellus® Learning System automates much of the busy-work associated with grading & keeping records so that parents can focus on what matters most–helping your child succeed. Enroll now! ⬇️

Фото профиля Yixuan Wang

Yixuan Wang1 год назад

Congrats! Very awesome work!!

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

Thank you Yixuan!

Похожие видео

Excited to release RT-Affordance! We propose conditioning policies on visual affordance plans as an intermediate representation that allows us to learn new tasks without collecting any new robot trajectories. Website and paper: Here’s a short 🧵

Excited to release RT-Affordance! We propose conditioning policies on visual affordance plans as an intermediate representation that allows us to learn new tasks without collecting any new robot trajectories. Website and paper: Here’s a short 🧵

Soroush Nasiriany

27,484 просмотров • 1 год назад

How to harness foundation models for *generalization in the wild* in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

How to harness foundation models for generalization in the wild in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

Wenlong Huang

293,876 просмотров • 3 лет назад

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into *in-context, low-level imitation learning machines*. 🚀 Let me explain. 👇🧵

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into in-context, low-level imitation learning machines. 🚀 Let me explain. 👇🧵

Norman Di Palo

23,094 просмотров • 2 лет назад

Introducing Boltus: The God of AI ⚡ Binge 🍿 all four episodes in this 🧵! Let Boltus teach you how to deploy diffusion models at scale 👉👉 (1/4)

Introducing Boltus: The God of AI ⚡ Binge 🍿 all four episodes in this 🧵! Let Boltus teach you how to deploy diffusion models at scale 👉👉 (1/4)

Lightning AI ⚡️

55,675 просмотров • 3 лет назад

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Ilia

32,956 просмотров • 9 месяцев назад

Too many scrollable list that are not scrolled. Exploring the best technique to create affordance that there's more on the scroll. Best example you've been out there?

Too many scrollable list that are not scrolled. Exploring the best technique to create affordance that there's more on the scroll. Best example you've been out there?

Micka

79,522 просмотров • 20 дней назад

Introducing an approach to directly ground video generation models to policy execution without needing any action labels! Our approach uses a generic goal-conditioned exploration procedure to learn a policy that works across robots / embodiments!

Introducing an approach to directly ground video generation models to policy execution without needing any action labels! Our approach uses a generic goal-conditioned exploration procedure to learn a policy that works across robots / embodiments!

Yilun Du

21,392 просмотров • 1 год назад

📢 Introducing DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models Compared to vanilla DPO, we improve paired data construction and preference label granularity, leading to better visual quality and motion strength with only 1/3 of the data. 🧵

📢 Introducing DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models Compared to vanilla DPO, we improve paired data construction and preference label granularity, leading to better visual quality and motion strength with only 1/3 of the data. 🧵

Ziyi Wu

35,402 просмотров • 1 год назад

$1\ 8 months ago Matt Gaetz appeared on Tim Pool just after the vote for Speaker. He gave a very insightful 20 minutes in the opener that I’m going to thread and bring context. This is very important IMO to dust off the shelf 🧵 👇$

1\ 8 months ago Matt Gaetz appeared on Tim Pool just after the vote for Speaker. He gave a very insightful 20 minutes in the opener that I’m going to thread and bring context. This is very important IMO to dust off the shelf 🧵 👇

TrashDiscourse

1,565,616 просмотров • 2 лет назад

Ange Postecoglou's automatized chance creation method aiming to exploit space at back post with low driven cross & space vacated at the edge in the box when defenders retreat with cutback with affordance for rebounds maximizing luck. 🧵Training to match:

Ange Postecoglou's automatized chance creation method aiming to exploit space at back post with low driven cross & space vacated at the edge in the box when defenders retreat with cutback with affordance for rebounds maximizing luck. 🧵Training to match:

'

21,799 просмотров • 1 год назад

This is a single uncut video, showing a robot learning several tasks instantly, after just one demonstration each ... This is possible because we've now been able to achieve in-context learning for everyday robotics tasks, and I'm very excited to announce our latest paper: 🎆 Instant Policy: In-Context Imitation Learning via Graph Diffusion 🎆 (1/6) 🧵👇

This is a single uncut video, showing a robot learning several tasks instantly, after just one demonstration each ... This is possible because we've now been able to achieve in-context learning for everyday robotics tasks, and I'm very excited to announce our latest paper: 🎆 Instant Policy: In-Context Imitation Learning via Graph Diffusion 🎆 (1/6) 🧵👇

Edward Johns

74,663 просмотров • 1 год назад

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

kyutai

52,598 просмотров • 1 месяц назад

What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇

What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇

Wenlong Huang

190,887 просмотров • 1 год назад

Continual learning sometimes gets discussed as if the goal is to dissolve the context/weights distinction. Let the model just keep accumulating, fine-tuning itself on the fly. Andrej Karpathy points out, though, that this isn't how humans do it. Our working memory gets wiped regularly. What we actually have is a consolidation process (sleep) that distills stuff into the brain, in a weird and lossy way. This is very different from how people sometimes talk about continual learning. It's not obvious it's something you can get for free from doing long enough RL loops.

Continual learning sometimes gets discussed as if the goal is to dissolve the context/weights distinction. Let the model just keep accumulating, fine-tuning itself on the fly. Andrej Karpathy points out, though, that this isn't how humans do it. Our working memory gets wiped regularly. What we actually have is a consolidation process (sleep) that distills stuff into the brain, in a weird and lossy way. This is very different from how people sometimes talk about continual learning. It's not obvious it's something you can get for free from doing long enough RL loops.

Dwarkesh Patel

59,462 просмотров • 1 месяц назад

I had an awesome time interviewing idan shenfeld and Jonas Hübotter from MIT and ETH Zurich about self-distillation. this very promising post-training paradigm where the model acts as its own teacher by conditioning on environment feedback or demonstrations. we cover the SDPO algo for reinforcement learning with rich feedback and SDFT for continual learning without forgetting along with many applications. we dig into how it works, why it's simpler and faster than GRPO, and where this is already showing up in production systems. table of content: 0:00 - what is self distillation 2:50 - idan (MIT) and jonas (ETH Zurich) introduction and motivation 18:40 - different perspective of on-policy self-distillation (presentation) 36:00 - metacognition and specificity in self-distillation 37:24 - very long hard task and self-distillation 42:00 - continual learning with self-distillation (presentation) 1:16:50 - what is next in this research direction? 1:20:00 - is there any experience with subjective feedbacks? 1:22:50 - quality vs number of feedbacks? 1:26:40 - what setting would self-distillation struggle vs GRPO? the slides were super crisp really cool of them to share! enjoy my guys 🌹

I had an awesome time interviewing idan shenfeld and Jonas Hübotter from MIT and ETH Zurich about self-distillation. this very promising post-training paradigm where the model acts as its own teacher by conditioning on environment feedback or demonstrations. we cover the SDPO algo for reinforcement learning with rich feedback and SDFT for continual learning without forgetting along with many applications. we dig into how it works, why it's simpler and faster than GRPO, and where this is already showing up in production systems. table of content: 0:00 - what is self distillation 2:50 - idan (MIT) and jonas (ETH Zurich) introduction and motivation 18:40 - different perspective of on-policy self-distillation (presentation) 36:00 - metacognition and specificity in self-distillation 37:24 - very long hard task and self-distillation 42:00 - continual learning with self-distillation (presentation) 1:16:50 - what is next in this research direction? 1:20:00 - is there any experience with subjective feedbacks? 1:22:50 - quality vs number of feedbacks? 1:26:40 - what setting would self-distillation struggle vs GRPO? the slides were super crisp really cool of them to share! enjoy my guys 🌹

Yacine Mahdid

12,945 просмотров • 1 месяц назад

The era of manual ML is over. Introducing NEO Neo AI the first Agentic ML Engineer. A system of 11 AI agents that designs, codes, and ships ML models automatically. Join the waitlist: Here’s how it works 👇 (VIDEO)

The era of manual ML is over. Introducing NEO Neo AI the first Agentic ML Engineer. A system of 11 AI agents that designs, codes, and ships ML models automatically. Join the waitlist: Here’s how it works 👇 (VIDEO)

SARAH

309,835 просмотров • 11 месяцев назад

Today, we are excited to share our performance of “Alien” from our headline show in Los Angeles this past summer. This is a song that is very near and dear to us. The last song off of our 2022 album, “In The Wild”. Our first ballad.

Today, we are excited to share our performance of “Alien” from our headline show in Los Angeles this past summer. This is a song that is very near and dear to us. The last song off of our 2022 album, “In The Wild”. Our first ballad.

The Interrupters

30,442 просмотров • 2 лет назад

🇺🇸 DAVID SACKS: AI THINKS WHITE MEN ARE WORTH LESS "I think what the paper purports to show is that almost all of these models, except for maybe Grok, view whites as less valuable than non-whites. If the paper is true, this is very concerning." Source: The All-In Podcast, David Sacks

🇺🇸 DAVID SACKS: AI THINKS WHITE MEN ARE WORTH LESS "I think what the paper purports to show is that almost all of these models, except for maybe Grok, view whites as less valuable than non-whites. If the paper is true, this is very concerning." Source: The All-In Podcast, David Sacks

Mario Nawfal

27,297 просмотров • 8 месяцев назад

We also present another paper at @SIGGRAPH 2023 on neural implicit 3D Morphable Models that can be used to create a dynamic 3D avatar from a single in-the-wild image. (Lead author Connor Lin).

Koki Nagano

12,758 просмотров • 3 лет назад