Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

How to scale visual affordance learning that is fine-grained, task-conditioned, works in-the-wild, in dynamic envs? Introducing Unsupervised Affordance Distillation (UAD): distills affordances from off-the-shelf foundation models, all without manual labels. Very excited this is nominated as Best Paper Finalist at #ICRA2025! 🧵👇

Wenlong Huang

5,735 subscribers

93,662 просмотров • 1 год назад •via X (Twitter)

Наука и технологии Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 11

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

Visual affordance allows robots to perceive actionable opportunities in an env, crucial for manipulation. We formulate affordance as language-conditioned pixel-level continuous probabilities, from identifying exact grasp point on handles, to where to press pumps & hold scissors.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

Yet scaling affordance is tough due to fine-grained labels. Our solution: automate labeling w/ vision and language foundation models (DINOv2 & GPT-4o) on sim-rendered 3D assets, enabling easy scaling to 10K+ object-query pairs (BEHAVIOR & Objaverse), all without human efforts.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

We first perform multi-view DINOv2 feature fusion for rendered 3D assets, cluster them, and then visually prompt VLMs to “brainstorm” associated tasks and identify relevant regions, where associated features are convolved over fused 3D features to obtain continuous annotations.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

We then train text-conditioned layers on top of DINOv2 – a key design enabling *zero-shot generalization* to complex real-world scenes despite trained only in sim. Intuitively, this connects self-supervised features that capture rich geometric structures to diverse task semantics.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

Compared to CLIP & open-vocab detectors, affordance stands out as continuous, fine-grained, manipulation-centric alternative. Surprisingly, it works on some unseen human activities too! With >200 Hz inference, it also runs on videos taken in the lab & Airbnb w/ hand-held camera.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

As a task-conditioned visual representation, it notably improves generalization in manipulation, especially text-following behaviors. Policies learned w/ 10 demos not only generalize to novel poses, instances, categories, but also to unseen instructions, all evaluated zero-shot.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

Check out our interactive demos and try your own images and prompts! The work is not possible without the great effort led by @Yihe_yihe and by the rest of the team: Yingke Wang @ChengshuEricLi Roy Yuan @RuohanZhang76 @jiajunwu_cs @drfeifei.

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

For more, check out: Website: Paper: Demo: Code: Full code and dataset will be released in the coming weeks.

Фото профиля Power Homeschool

Power Homeschool2 лет назад

The Acellus® Learning System automates much of the busy-work associated with grading & keeping records so that parents can focus on what matters most–helping your child succeed. Enroll now! ⬇️

Фото профиля Yixuan Wang

Yixuan Wang1 год назад

Congrats! Very awesome work!!

Фото профиля Wenlong Huang

Wenlong Huang1 год назад

Thank you Yixuan!

Похожие видео

$1\ 8 months ago Matt Gaetz appeared on Tim Pool just after the vote for Speaker. He gave a very insightful 20 minutes in the opener that I’m going to thread and bring context. This is very important IMO to dust off the shelf 🧵 👇$

1\ 8 months ago Matt Gaetz appeared on Tim Pool just after the vote for Speaker. He gave a very insightful 20 minutes in the opener that I’m going to thread and bring context. This is very important IMO to dust off the shelf 🧵 👇

TrashDiscourse

1,565,616 просмотров • 2 лет назад

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

kyutai

53,003 просмотров • 2 месяцев назад

Continual learning sometimes gets discussed as if the goal is to dissolve the context/weights distinction. Let the model just keep accumulating, fine-tuning itself on the fly. Andrej Karpathy points out, though, that this isn't how humans do it. Our working memory gets wiped regularly. What we actually have is a consolidation process (sleep) that distills stuff into the brain, in a weird and lossy way. This is very different from how people sometimes talk about continual learning. It's not obvious it's something you can get for free from doing long enough RL loops.

Continual learning sometimes gets discussed as if the goal is to dissolve the context/weights distinction. Let the model just keep accumulating, fine-tuning itself on the fly. Andrej Karpathy points out, though, that this isn't how humans do it. Our working memory gets wiped regularly. What we actually have is a consolidation process (sleep) that distills stuff into the brain, in a weird and lossy way. This is very different from how people sometimes talk about continual learning. It's not obvious it's something you can get for free from doing long enough RL loops.

Dwarkesh Patel

59,763 просмотров • 2 месяцев назад

I had an awesome time interviewing idan shenfeld and Jonas Hübotter from MIT and ETH Zurich about self-distillation. this very promising post-training paradigm where the model acts as its own teacher by conditioning on environment feedback or demonstrations. we cover the SDPO algo for reinforcement learning with rich feedback and SDFT for continual learning without forgetting along with many applications. we dig into how it works, why it's simpler and faster than GRPO, and where this is already showing up in production systems. table of content: 0:00 - what is self distillation 2:50 - idan (MIT) and jonas (ETH Zurich) introduction and motivation 18:40 - different perspective of on-policy self-distillation (presentation) 36:00 - metacognition and specificity in self-distillation 37:24 - very long hard task and self-distillation 42:00 - continual learning with self-distillation (presentation) 1:16:50 - what is next in this research direction? 1:20:00 - is there any experience with subjective feedbacks? 1:22:50 - quality vs number of feedbacks? 1:26:40 - what setting would self-distillation struggle vs GRPO? the slides were super crisp really cool of them to share! enjoy my guys 🌹

I had an awesome time interviewing idan shenfeld and Jonas Hübotter from MIT and ETH Zurich about self-distillation. this very promising post-training paradigm where the model acts as its own teacher by conditioning on environment feedback or demonstrations. we cover the SDPO algo for reinforcement learning with rich feedback and SDFT for continual learning without forgetting along with many applications. we dig into how it works, why it's simpler and faster than GRPO, and where this is already showing up in production systems. table of content: 0:00 - what is self distillation 2:50 - idan (MIT) and jonas (ETH Zurich) introduction and motivation 18:40 - different perspective of on-policy self-distillation (presentation) 36:00 - metacognition and specificity in self-distillation 37:24 - very long hard task and self-distillation 42:00 - continual learning with self-distillation (presentation) 1:16:50 - what is next in this research direction? 1:20:00 - is there any experience with subjective feedbacks? 1:22:50 - quality vs number of feedbacks? 1:26:40 - what setting would self-distillation struggle vs GRPO? the slides were super crisp really cool of them to share! enjoy my guys 🌹

Yacine Mahdid

13,023 просмотров • 3 месяцев назад

Today, we are excited to share our performance of “Alien” from our headline show in Los Angeles this past summer. This is a song that is very near and dear to us. The last song off of our 2022 album, “In The Wild”. Our first ballad.

Today, we are excited to share our performance of “Alien” from our headline show in Los Angeles this past summer. This is a song that is very near and dear to us. The last song off of our 2022 album, “In The Wild”. Our first ballad.

The Interrupters

30,442 просмотров • 2 лет назад

🇺🇸 DAVID SACKS: AI THINKS WHITE MEN ARE WORTH LESS "I think what the paper purports to show is that almost all of these models, except for maybe Grok, view whites as less valuable than non-whites. If the paper is true, this is very concerning." Source: The All-In Podcast, David Sacks

🇺🇸 DAVID SACKS: AI THINKS WHITE MEN ARE WORTH LESS "I think what the paper purports to show is that almost all of these models, except for maybe Grok, view whites as less valuable than non-whites. If the paper is true, this is very concerning." Source: The All-In Podcast, David Sacks

Mario Nawfal

27,297 просмотров • 9 месяцев назад

Super excited for the release of Robot Utility Models (RUMs)! RUMs is a simple method to build zero-shot robot policies that can solve useful tasks in completely new homes without any additional training often at 90%+ success rate. 🧵👇

Super excited for the release of Robot Utility Models (RUMs)! RUMs is a simple method to build zero-shot robot policies that can solve useful tasks in completely new homes without any additional training often at 90%+ success rate. 🧵👇

Lerrel Pinto

56,591 просмотров • 1 год назад

Collaborative Score Distillation for Consistent Visual Synthesis paper page: Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

Collaborative Score Distillation for Consistent Visual Synthesis paper page: Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

AK

33,500 просмотров • 3 лет назад

Excited to finally share Generative Value Learning (GVL), my Google DeepMind project on extracting universal value functions from long-context VLMs via in-context learning! We discovered a simple method to generate zero-shot and few-shot values for 300+ robot tasks and 50+ datasets using SOTA VLMs like Gemini (Try out the demo on our website on your robot video today!) I worked a lot on leveraging foundation models as guidance for robots in my PhD, and to me, this result forges a new frontier in how we can use foundation models for robot learning, given its broad applicability independent of embodiment and task types. Quite excited about how we can build on this work as a community!

Excited to finally share Generative Value Learning (GVL), my Google DeepMind project on extracting universal value functions from long-context VLMs via in-context learning! We discovered a simple method to generate zero-shot and few-shot values for 300+ robot tasks and 50+ datasets using SOTA VLMs like Gemini (Try out the demo on our website on your robot video today!) I worked a lot on leveraging foundation models as guidance for robots in my PhD, and to me, this result forges a new frontier in how we can use foundation models for robot learning, given its broad applicability independent of embodiment and task types. Quite excited about how we can build on this work as a community!

Jason Ma

98,090 просмотров • 1 год назад

From the this is really all about how poorly Ottawa is playing and not the 113-point team that had the second best record in the league file. Sens are very good. Maybe, just maybe, not quite as good as the Carolina Hurricanes

From the this is really all about how poorly Ottawa is playing and not the 113-point team that had the second best record in the league file. Sens are very good. Maybe, just maybe, not quite as good as the Carolina Hurricanes

Adam Gold

40,289 просмотров • 3 месяцев назад

Next up is a very rare promo mix by a legendary duo, 24 years ago: This is The Chemical Brothers – DJ Set, Paris 24.01.02 This was a promo CD sent out to radio stations and other labels of a live set by The Chem Bros. It includes the foundation for In Glint. One of their rarer works. Tracklist in comments. Discogs: Reminder that all mixes uploaded here today go on my Kofi for download tonight. 👇🔥👇

Next up is a very rare promo mix by a legendary duo, 24 years ago: This is The Chemical Brothers – DJ Set, Paris 24.01.02 This was a promo CD sent out to radio stations and other labels of a live set by The Chem Bros. It includes the foundation for In Glint. One of their rarer works. Tracklist in comments. Discogs: Reminder that all mixes uploaded here today go on my Kofi for download tonight. 👇🔥👇

DS

11,454 просмотров • 2 месяцев назад

We have seen a lot of legged robots doing navigation in the wild. But how about mobile manipulation in the wild? I have been pushing the direction of learning a unified, efficient, and dynamic 3D representation of scenes (for navigation) and objects (for manipulation) for the past two years. And now we have GeFF --- our large-scale, generalizable feature field, that combines the speed of a feed-forward neural network with the rich semantics from Foundation Models, to handle dynamically changing scenes, and enable open-ended, language-grounded scene and object understanding.

We have seen a lot of legged robots doing navigation in the wild. But how about mobile manipulation in the wild? I have been pushing the direction of learning a unified, efficient, and dynamic 3D representation of scenes (for navigation) and objects (for manipulation) for the past two years. And now we have GeFF --- our large-scale, generalizable feature field, that combines the speed of a feed-forward neural network with the rich semantics from Foundation Models, to handle dynamically changing scenes, and enable open-ended, language-grounded scene and object understanding.

Xiaolong Wang

42,767 просмотров • 2 лет назад

🚨 [New Paper] The Adam optimizer is a zombie algorithm... It senses and adapts the learning rate, sure. But the update rule itself? Fixed, frozen. Decided before even the training starts. It works in some regions of the loss landscape and fails in others. What if the optimizer itself was an agent, free to learn its own trajectory through the landscape and adjust its own update rule at every step? and maybe transfer its learned policy to train models on unseen datasets! Introducing: PILOT (Policy-Informed Learned OpTimizer) 📄Preprint: 🧵TLDR 👇

🚨 [New Paper] The Adam optimizer is a zombie algorithm... It senses and adapts the learning rate, sure. But the update rule itself? Fixed, frozen. Decided before even the training starts. It works in some regions of the loss landscape and fails in others. What if the optimizer itself was an agent, free to learn its own trajectory through the landscape and adjust its own update rule at every step? and maybe transfer its learned policy to train models on unseen datasets! Introducing: PILOT (Policy-Informed Learned OpTimizer) 📄Preprint: 🧵TLDR 👇

Sattam

16,872 просмотров • 2 месяцев назад

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Boyi Li

66,999 просмотров • 7 месяцев назад

David Sacks says Anthropic could stop Chinese distillation if they wanted to, but they won't because it slows growth "The way that you know that this whole distillation thing is fake, is because if stopping distillation was their primary objective, Anthropic would push to ban Chinese access to American models, not American access to Chinese models." "They’re the ones in the best position to block it. If industrial scale distillation is a national security threat, they’re the ones who need to stop it, because that is the place where distillation occurs. You have to stop it at the source." "They know that if they KYC their customers, it’ll slow their growth. So instead, what they’re saying is, “Hey, ban our competitors.” If they really think it’s that big a threat, they should use a few points of their 90% gross margins to do that." "It seems to me that this debate is all backwards. The question should be on Anthropic to explain why it’s doing such a bad job, not on the whole American open source ecosystem to be punished for Anthropic’s failure."

David Sacks says Anthropic could stop Chinese distillation if they wanted to, but they won't because it slows growth "The way that you know that this whole distillation thing is fake, is because if stopping distillation was their primary objective, Anthropic would push to ban Chinese access to American models, not American access to Chinese models." "They’re the ones in the best position to block it. If industrial scale distillation is a national security threat, they’re the ones who need to stop it, because that is the place where distillation occurs. You have to stop it at the source." "They know that if they KYC their customers, it’ll slow their growth. So instead, what they’re saying is, “Hey, ban our competitors.” If they really think it’s that big a threat, they should use a few points of their 90% gross margins to do that." "It seems to me that this debate is all backwards. The question should be on Anthropic to explain why it’s doing such a bad job, not on the whole American open source ecosystem to be punished for Anthropic’s failure."

dnap

188,369 просмотров • 3 дней назад

HOW TO DO THE MADDEN 26 SPEED BOOST GLITCH! This is the most unstoppable tactic in the game right now! Go to I Form and find a stretch Use the Gadget at HB package (put your best player as the gadget in your depth chart) Here's how the speed boost works 👇

HOW TO DO THE MADDEN 26 SPEED BOOST GLITCH! This is the most unstoppable tactic in the game right now! Go to I Form and find a stretch Use the Gadget at HB package (put your best player as the gadget in your depth chart) Here's how the speed boost works 👇

Sportsgamerz

82,653 просмотров • 9 месяцев назад

Jensen Huang on "distillation" On his new interview with axios, he was asked this question "Should open source model companies be allowed to distill closed models" "Distillation—learning from AI, learning from other people, and learning from other sources of knowledge, is fundamental to intelligence. We are constantly learning from other people. I am learning from you through the questions you are asking, and you are learning from me. All day long, we are learning from one another. AI also has to learn from something. The original AI models, whether they were open or closed, were trained on previously created knowledge from the internet. Now, AI is generating more content than humans. In a few more years, the internet could be 99% AI-generated content, and that content will have been created by some form of AI. As a result, AI systems will constantly be distilling knowledge and intelligence from other AI systems. The fact that AI can learn is a good thing. We want AI systems to be intelligent because a smarter AI can also be a safer AI." ---- From "Axios" YouTube channel, (full video link in comment)

Jensen Huang on "distillation" On his new interview with axios, he was asked this question "Should open source model companies be allowed to distill closed models" "Distillation—learning from AI, learning from other people, and learning from other sources of knowledge, is fundamental to intelligence. We are constantly learning from other people. I am learning from you through the questions you are asking, and you are learning from me. All day long, we are learning from one another. AI also has to learn from something. The original AI models, whether they were open or closed, were trained on previously created knowledge from the internet. Now, AI is generating more content than humans. In a few more years, the internet could be 99% AI-generated content, and that content will have been created by some form of AI. As a result, AI systems will constantly be distilling knowledge and intelligence from other AI systems. The fact that AI can learn is a good thing. We want AI systems to be intelligent because a smarter AI can also be a safer AI." ---- From "Axios" YouTube channel, (full video link in comment)

Rohan Paul

432,514 просмотров • 3 дней назад

Learned visuomotor policies are notoriously fragile, they break with changes in conditions like lighting, clutter, or object variations amongst other things. In Yunchu @ CoRL2025's latest work, we asked whether we could get these policies to be robust and generalizable with a clever choice of visual representation! The argument we made was - we want a choice of visual representation that specifically adapts to be sufficient, yet minimal for the task at hand. We thought about it from the perspective of flexible, key-point based representations. The key question becomes - how do we choose a sufficient, task-specific, yet minimal set of keypoints as a representation for policy learning. Yunchu proposes a neat way of automatically selecting task-relevant keypoints using a standard supervised learning objective, and using this for robust policy learning. This is largely under the same assumptions as behavior cloning, but with huge gains on robustness. Let’s understand how, 🧵 (1/8)

Learned visuomotor policies are notoriously fragile, they break with changes in conditions like lighting, clutter, or object variations amongst other things. In Yunchu @ CoRL2025's latest work, we asked whether we could get these policies to be robust and generalizable with a clever choice of visual representation! The argument we made was - we want a choice of visual representation that specifically adapts to be sufficient, yet minimal for the task at hand. We thought about it from the perspective of flexible, key-point based representations. The key question becomes - how do we choose a sufficient, task-specific, yet minimal set of keypoints as a representation for policy learning. Yunchu proposes a neat way of automatically selecting task-relevant keypoints using a standard supervised learning objective, and using this for robust policy learning. This is largely under the same assumptions as behavior cloning, but with huge gains on robustness. Let’s understand how, 🧵 (1/8)

Abhishek Gupta

11,355 просмотров • 1 год назад

THIS GUY BUILT A VISUAL CALENDAR THAT SHOWS YOU EVERY SUBSCRIPTION YOU'RE PAYING FOR the best part is it auto-imports your subscriptions directly from the app store so no manual entry, no typing in amounts, it just pulls everything automatically you open it and instantly see when every charge is hitting and how much is going out each month we're all paying for at least 5-10 subscriptions we forgot about and this puts all of them in one visual calendar clean UI and a simple concept that solves a problem everyone has but nobody bothers to fix

THIS GUY BUILT A VISUAL CALENDAR THAT SHOWS YOU EVERY SUBSCRIPTION YOU'RE PAYING FOR the best part is it auto-imports your subscriptions directly from the app store so no manual entry, no typing in amounts, it just pulls everything automatically you open it and instantly see when every charge is hitting and how much is going out each month we're all paying for at least 5-10 subscriptions we forgot about and this puts all of them in one visual calendar clean UI and a simple concept that solves a problem everyone has but nobody bothers to fix

Om Patel

149,225 просмотров • 1 месяц назад