Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Chain-of-thought reasoning is a powerful tool to enable language models to work through complex problems. Can we use this with robots? With embodied chain-of-thought, vision-language-action (VLA) models can think through perception and planning! A 🧵👇

Sergey Levine

119,919 subscribers

30,388 görüntüleme • 2 yıl önce •via X (Twitter)

Eğitim Bilim & Teknoloji Haberler & Politika

Anya Rossi• Live Now

Private livecam show

9 Yorum

Sergey Levine profil fotoğrafı

Sergey Levine2 yıl önce

Embodied chain of thought allows the VLA (in this case, a finetuned OpenVLA model) to work through a complex task by reasoning over subtasks, detecting objects, and making step-by-step plans. When generating an action, the VLA works through these steps automatically.

Sergey Levine profil fotoğrafı

Sergey Levine2 yıl önce

How do we train OpenVLA for embodied chain of thought? We distill a variety of other foundation models, such as Gemini and Grounding DINO, into synthetic examples that can teach the VLA to perform embodied chain of thought.

Sergey Levine profil fotoğrafı

Sergey Levine2 yıl önce

The resulting model can solve complex tasks that require multi stage inferences. It can generalize more effectively to novel objects, perform longer tasks, and understand sophisticated instructions.

Sergey Levine profil fotoğrafı

Sergey Levine2 yıl önce

The resulting VLA can even interpret human corrections and interventions, incorporating them as corrections into the embodied chain of thought process!

Sergey Levine profil fotoğrafı

Sergey Levine2 yıl önce

While our main experiments use the Bridge v2 setup: We also tested on a variety of other embodiments from OXE:

Sergey Levine profil fotoğrafı

Sergey Levine2 yıl önce

This was a really fun collaboration with @MiZawalski, @verityw_, @KarlPertsch, @oier_mees, @chelseabfinn Website: Paper:

Sergey Levine profil fotoğrafı

Sergey Levine2 yıl önce

For more, check out these posts by Michal and Will:

Alex 📚 PromptLeo profil fotoğrafı

Alex 📚 PromptLeo2 yıl önce

Thanks for sharing 👍 How do you create videos like these?

Joanne Mercado profil fotoğrafı

Joanne Mercado2 yıl önce

You’re making a lot of progress in robotics

Benzer Videolar

Embodied chain of thought (ECoT) is a powerful tool to get VLAs to think through problems, but why does it work? In our new work, we analyze various lightweight ECoT-like strategies, including co-training, to see what is the "minimal" amount of reasoning that can boost VLAs 🧵👇

Embodied chain of thought (ECoT) is a powerful tool to get VLAs to think through problems, but why does it work? In our new work, we analyze various lightweight ECoT-like strategies, including co-training, to see what is the "minimal" amount of reasoning that can boost VLAs 🧵👇

Sergey Levine

22,415 görüntüleme • 1 yıl önce

Vision-language models (VLMs) can see well, but they struggle to reason. In this episode, Antonia Wüst (PhD researcher, TU Darmstadt) explains how combining VLMs with program synthesis yields more reliable visual reasoning, with fewer tokens than chain-of-thought.

Vision-language models (VLMs) can see well, but they struggle to reason. In this episode, Antonia Wüst (PhD researcher, TU Darmstadt) explains how combining VLMs with program synthesis yields more reliable visual reasoning, with fewer tokens than chain-of-thought.

Ndea

22,130 görüntüleme • 6 ay önce

Vision-language models can control robots, but what if the prompt is too complex for the robot to follow directly? We developed a way to get robots to “think through” complex instructions, feedback, and interjections. We call it the Hierarchical Interactive Robot (Hi Robot).

Vision-language models can control robots, but what if the prompt is too complex for the robot to follow directly? We developed a way to get robots to “think through” complex instructions, feedback, and interjections. We call it the Hierarchical Interactive Robot (Hi Robot).

Physical Intelligence

116,845 görüntüleme • 1 yıl önce

Language following is a tough problem for VLAs: while these models can follow complex language, in practice getting datasets that enable language following is hard. We developed a method to counterfactually and automatically label data to improve language following! 🧵👇

Language following is a tough problem for VLAs: while these models can follow complex language, in practice getting datasets that enable language following is hard. We developed a method to counterfactually and automatically label data to improve language following! 🧵👇

Sergey Levine

44,229 görüntüleme • 11 ay önce

VLAExplain — Interpreting Vision-Language-Action (VLA) Models VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models. Currently, attention analysis is supported for both the pi05 and unifolm-vla models. For details, please check pi05 and UnifoLM-VLA readme files respectively. Demo of pi05 in action:

VLAExplain — Interpreting Vision-Language-Action (VLA) Models VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models. Currently, attention analysis is supported for both the pi05 and unifolm-vla models. For details, please check pi05 and UnifoLM-VLA readme files respectively. Demo of pi05 in action:

Ryohei Sasaki@engineer

12,774 görüntüleme • 2 ay önce

What if robots could think longer on harder problems without saying a single word?🤔 We introduce RD-VLA (Recurrent-Depth VLA): a latent, iterative reasoning architecture for robot control. ❌No Chain-of-Thought tokens. ❌No extra memory overhead. ✅Just reasoning—directly in latent space. 🧠🤖 Project page: 👇🧵

What if robots could think longer on harder problems without saying a single word?🤔 We introduce RD-VLA (Recurrent-Depth VLA): a latent, iterative reasoning architecture for robot control. ❌No Chain-of-Thought tokens. ❌No extra memory overhead. ✅Just reasoning—directly in latent space. 🧠🤖 Project page: 👇🧵

Jiafei Duan

13,833 görüntüleme • 5 ay önce

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

AK

85,570 görüntüleme • 2 yıl önce

Day 2 of 3 MLX Releases: Introducing Local Computer-Use 🚀🔥 A powerful tool built with MLX that uses Vision Language models and Voice models to control your Mac through visual understanding, planning and reasoning. Features ⚡️Automate your workflow with natural language 😎 Control your computer “hands-free” This project now supports both: 🤖 Level 1 (GUI Agent) 🧠 Level 2 (Autonomous GUI Agent) Get started: > pip install -U mlx-vlm mlx-audio mlx-whisper Please leave us a star and send a PR :)

Day 2 of 3 MLX Releases: Introducing Local Computer-Use 🚀🔥 A powerful tool built with MLX that uses Vision Language models and Voice models to control your Mac through visual understanding, planning and reasoning. Features ⚡️Automate your workflow with natural language 😎 Control your computer “hands-free” This project now supports both: 🤖 Level 1 (GUI Agent) 🧠 Level 2 (Autonomous GUI Agent) Get started: > pip install -U mlx-vlm mlx-audio mlx-whisper Please leave us a star and send a PR :)

Prince Canuma

45,867 görüntüleme • 1 yıl önce

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Ilia

32,956 görüntüleme • 10 ay önce

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

Jiafei Duan

48,773 görüntüleme • 1 yıl önce

Google DeepMind introduced two foundational models for embodied reasoning, enabling robots to comprehend, react, and take action in the physical world: ⦿ Gemini Robotics – built on Gemini 2.0. Integrates vision, language, and action for real-world dexterity, . ⦿ Gemini Robotics-ER – Enhances spatial reasoning for advanced robotic control. They are working with Apptronik to develop the next generation of humanoid robots.

Google DeepMind introduced two foundational models for embodied reasoning, enabling robots to comprehend, react, and take action in the physical world: ⦿ Gemini Robotics – built on Gemini 2.0. Integrates vision, language, and action for real-world dexterity, . ⦿ Gemini Robotics-ER – Enhances spatial reasoning for advanced robotic control. They are working with Apptronik to develop the next generation of humanoid robots.

The Humanoid Hub

73,097 görüntüleme • 1 yıl önce

Excited to introduce 𝐋𝐀𝐏𝐀: the first unsupervised pretraining method for Vision-Language-Action models. Outperforms SOTA models trained with ground-truth actions 30x more efficient than conventional VLA pretraining 📝: 🧵 1/9

Excited to introduce 𝐋𝐀𝐏𝐀: the first unsupervised pretraining method for Vision-Language-Action models. Outperforms SOTA models trained with ground-truth actions 30x more efficient than conventional VLA pretraining 📝: 🧵 1/9

Joel Jang

46,018 görüntüleme • 1 yıl önce

The next frontier of autonomous driving is unlocked by reasoning models. NVIDIA Alpamayo brings together open AI models with reasoning capabilities, closed-loop simulation tools, and massive real-world driving datasets. Alpamayo 1 is a vision–language–action model that explains its own decisions through explicit reasoning traces, enabling trustworthy, humanlike decision-making. Together with NVIDIA’s Physical AI dataset and AlpaSim simulation, Alpamayo provides the tools and scale required to enable level 4 autonomous vehicles. ▶️ Watch now:

The next frontier of autonomous driving is unlocked by reasoning models. NVIDIA Alpamayo brings together open AI models with reasoning capabilities, closed-loop simulation tools, and massive real-world driving datasets. Alpamayo 1 is a vision–language–action model that explains its own decisions through explicit reasoning traces, enabling trustworthy, humanlike decision-making. Together with NVIDIA’s Physical AI dataset and AlpaSim simulation, Alpamayo provides the tools and scale required to enable level 4 autonomous vehicles. ▶️ Watch now:

NVIDIA DRIVE

35,324 görüntüleme • 6 ay önce

Want to see our open models in action? Watch how gpt-oss builds a video game—using tools step-by-step within chain-of-thought reasoning 👾🍓

Want to see our open models in action? Watch how gpt-oss builds a video game—using tools step-by-step within chain-of-thought reasoning 👾🍓

OpenAI

488,957 görüntüleme • 11 ay önce

💡Can robots autonomously design their own tools and figure out how to use them? We present VLMgineer 🛠️, a framework that leverages Vision Language Models with Evolutionary Search to automatically generate and refine physical tool designs alongside corresponding robot action plans. ✨ VLMgineer can fully automate tool and action design with AI-driven physical creativity. No human intervention. No pre-defined templates or few-shot examples. ✨ VLMgineer outperforms human-specified designs and existing everyday tools. ✨ We let the VLM fully decide how to evolve designs. Deep dive with me: 🧵

💡Can robots autonomously design their own tools and figure out how to use them? We present VLMgineer 🛠️, a framework that leverages Vision Language Models with Evolutionary Search to automatically generate and refine physical tool designs alongside corresponding robot action plans. ✨ VLMgineer can fully automate tool and action design with AI-driven physical creativity. No human intervention. No pre-defined templates or few-shot examples. ✨ VLMgineer outperforms human-specified designs and existing everyday tools. ✨ We let the VLM fully decide how to evolve designs. Deep dive with me: 🧵

Junyao Shi

29,764 görüntüleme • 1 yıl önce

Geoffrey Hinton says AI reasoning can be real thought because language itself is a form of thinking Words let humans and AI model almost anything, but human thought goes beyond words — through images, space, and physical movement "the smarter system is the one that can use all of them"

Geoffrey Hinton says AI reasoning can be real thought because language itself is a form of thinking Words let humans and AI model almost anything, but human thought goes beyond words — through images, space, and physical movement "the smarter system is the one that can use all of them"

Haider.

24,880 görüntüleme • 2 ay önce

Autonomous driving with Chain of Thought - autopilot thinking out loud in text! LINGO-1 is the most interesting work I've read in autodriving for a while. Before: perception -> driving action After: perception -> textual reasoning -> action LINGO-1 trains a video-language model that comments on the ongoing scene. You can ask it to explain its decisions ("why are you stopped?") and planning ("what are you gonna do next?"). The explicit reasoning step comes with key benefits: - Explainability: driving models are no longer a mysterious blackbox that you pray for safety. - Counterfactuals: it's able to imagine scenarios that are not in the training data, and reason through how to handle them correctly. - Long-tail programming: there are soooo many edge cases in driving. It's impossible to have good data coverage on everything. Instead of collecting 1000s of examples to "neural program" a case, you can now have a human teacher write prompts to explain a handful of examples. LINGO-1 is closely related to a few works in game AI: - MineDojo (my team's work at NVIDIA, learns a reward model that aligns Minecraft gameplay videos with their transcripts. The model, called "MineCLIP", is able to ground commentary text in the video pixels. - Thought Cloning (Jeff Clune): pixel -> language -> action loop in gridworlds.

Autonomous driving with Chain of Thought - autopilot thinking out loud in text! LINGO-1 is the most interesting work I've read in autodriving for a while. Before: perception -> driving action After: perception -> textual reasoning -> action LINGO-1 trains a video-language model that comments on the ongoing scene. You can ask it to explain its decisions ("why are you stopped?") and planning ("what are you gonna do next?"). The explicit reasoning step comes with key benefits: - Explainability: driving models are no longer a mysterious blackbox that you pray for safety. - Counterfactuals: it's able to imagine scenarios that are not in the training data, and reason through how to handle them correctly. - Long-tail programming: there are soooo many edge cases in driving. It's impossible to have good data coverage on everything. Instead of collecting 1000s of examples to "neural program" a case, you can now have a human teacher write prompts to explain a handful of examples. LINGO-1 is closely related to a few works in game AI: - MineDojo (my team's work at NVIDIA, learns a reward model that aligns Minecraft gameplay videos with their transcripts. The model, called "MineCLIP", is able to ground commentary text in the video pixels. - Thought Cloning (Jeff Clune): pixel -> language -> action loop in gridworlds.

Jim Fan

552,760 görüntüleme • 2 yıl önce

Excited to share a glimpse of what’s possible with specialized models on OpenLedger🐙 In this video, you’ll see a chat agent in action—think of it as the Perplexity of crypto. Powered by tool calling, this specialized model showcases how OpenLedger enables domain-specific intelligence for AI-driven apps. Vision of these specialised models is to use english to interact, query, and transact on-chain.🐙

Excited to share a glimpse of what’s possible with specialized models on OpenLedger🐙 In this video, you’ll see a chat agent in action—think of it as the Perplexity of crypto. Powered by tool calling, this specialized model showcases how OpenLedger enables domain-specific intelligence for AI-driven apps. Vision of these specialised models is to use english to interact, query, and transact on-chain.🐙

OpenLedger

87,878 görüntüleme • 1 yıl önce

New paper introduces NaVILA, a vision-language-action (VLA) model that integrates high-level visual-language understanding and low-level locomotion control. It enables humanoid or quadruped robots to navigate unseen environments with natural language instructions.

New paper introduces NaVILA, a vision-language-action (VLA) model that integrates high-level visual-language understanding and low-level locomotion control. It enables humanoid or quadruped robots to navigate unseen environments with natural language instructions.

The Humanoid Hub

20,458 görüntüleme • 1 yıl önce