Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into in-context, low-level imitation learning machines. 🚀 Let me explain. 👇🧵

Norman Di Palo

1,517 subscribers

23,094 Aufrufe • vor 2 Jahren •via X (Twitter)

Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

11 Kommentare

Profilbild von Norman Di Palo

Norman Di Palovor 2 Jahren

We represent visual inputs as keypoint tokens, and robot poses as action tokens. Keypoints are extracted via DINO, projected in 3D, and represented as text. The gripper pose is represented as a triplet of 3D points in the same euclidean space as the keypoints. There's more👇🧵

Profilbild von Norman Di Palo

Norman Di Palovor 2 Jahren

Recording a demonstration means storing keypoint tokens from the scene and the action tokens of the trajectory. The numbers are represented as text (tokens), and can therefore go into an LLM prompt. This is where things get interesting... 👇🧵

Profilbild von Norman Di Palo

Norman Di Palovor 2 Jahren

With the Keypoint Action Tokens "language", we unlock the pattern extraction and completion abilities of LLMs, from language to robotics. Given keypoint tokens from a new scene, we demonstrate that LLMs can output an effective trajectory of action tokens! 👇🧵

Profilbild von Norman Di Palo

Norman Di Palovor 2 Jahren

Our experiments demonstrate that: 1) KAT surpasses SOTA methods like Diffusion Policies (DP) 2) the KAT representation can boost DP performance as well, improving on the end-to-end version KAT is also robust to distractors, as we show in the third execution of the video. 👇🧵

Profilbild von Norman Di Palo

Norman Di Palovor 2 Jahren

A fascinating result is that KAT's performance improves as the underlying LLMs improve. Using GPT-4 Turbo outperforms GPT-3.5 Turbo. This suggests that the improvement and scaling of LLMs can lead to free improvement to robotics, using LLMs as imitation learning machines. 👇🧵

Profilbild von Norman Di Palo

Norman Di Palovor 2 Jahren

In-context learning also means *zero training time*, and the ability to deploy and test skills immediately, or store them as text and retrieve them later. All our pipeline relies on pre-trained Foundation Models and can be deployed immediately. 👇🧵

Profilbild von Norman Di Palo

Norman Di Palovor 2 Jahren

Therefore, we show one of the most interesting uses of LLMs in robotics is *without any natural language*. Surprising! 💡 Supervised by @Ed__Johns. Paper: Website with videos:

Profilbild von XXXin

XXXinvor 1 Jahr

Does it mean the user gives no prompt but the debit will determine which task to perform based on the initial scene configuration? What comes next?

Profilbild von Norman Di Palo

Norman Di Palovor 1 Jahr

This method was trained on single tasks in isolation. Extending it to receive language commands has been done eg here

Profilbild von ऋषिक तिवारी (Rishik Tiwari)

ऋषिक तिवारी (Rishik Tiwari)vor 2 Jahren

Really interesting, I lately have been trying to do something similar but with egocentric cam and affordance maps (inspired from VoxPoser). While KAT may work for topdown RGBD, ego views will be difficult as it gets obstructed after grasping the obj.

Profilbild von Norman Di Palo

Norman Di Palovor 2 Jahren

Interesting! I worked quite a bit with wrist-cams and they offer advantages and disadvantages. I am currently working on some wrist-cam based extensions.

Ähnliche Videos

Very excited to announce: Keypoint Action Tokens! We found that LLMs can be repurposed as "imitation learning engines" for robots, by representing both observations & actions as 3D keypoints, and feeding into an LLM for in-context learning. See: More 👇

Very excited to announce: Keypoint Action Tokens! We found that LLMs can be repurposed as "imitation learning engines" for robots, by representing both observations & actions as 3D keypoints, and feeding into an LLM for in-context learning. See: More 👇

Edward Johns

32,569 Aufrufe • vor 2 Jahren

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Kaustubh Sridhar

52,158 Aufrufe • vor 10 Monaten

We can teach LLMs to write better robot code through natural language feedback. But can LLMs remember what they were taught and improve their teachability over time? Introducing our latest work, Learning to Learn Faster from Human Feedback with Language Model Predictive Control

We can teach LLMs to write better robot code through natural language feedback. But can LLMs remember what they were taught and improve their teachability over time? Introducing our latest work, Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Jacky Liang

86,652 Aufrufe • vor 2 Jahren

NeurIPS 2025 Paper: LLMs are Reinforcement Learners 🤯! Surprisingly, we show that LLMs can solve RL tasks without any external component! We introduce Prompted Policy Search (ProPS), an RL method based only LLMs and in-context learning. [Paper]

NeurIPS 2025 Paper: LLMs are Reinforcement Learners 🤯! Surprisingly, we show that LLMs can solve RL tasks without any external component! We introduce Prompted Policy Search (ProPS), an RL method based only LLMs and in-context learning. [Paper]

Heni Ben Amor

51,248 Aufrufe • vor 7 Monaten

Excited to share RoCo: Dialectic Multi-Robot Collaboration with Large Language Models. We propose a novel approach to multi-robot collaboration that leverages LLMs for both high-level communication and low-level path planning. w/ Shreeya Jain, Shuran Song

Excited to share RoCo: Dialectic Multi-Robot Collaboration with Large Language Models. We propose a novel approach to multi-robot collaboration that leverages LLMs for both high-level communication and low-level path planning. w/ Shreeya Jain, Shuran Song

Mandi Zhao

88,704 Aufrufe • vor 2 Jahren

With OpenAI, Figure 01 can now have full conversations with people -OpenAI models provide high-level visual and language intelligence -Figure neural networks deliver fast, low-level, dexterous robot actions Everything in this video is a neural network:

With OpenAI, Figure 01 can now have full conversations with people -OpenAI models provide high-level visual and language intelligence -Figure neural networks deliver fast, low-level, dexterous robot actions Everything in this video is a neural network:

Figure

5,100,815 Aufrufe • vor 2 Jahren

Should we design a new programming language for LLMs, that LLMs can use more efficiently? Chris Lattner, the creator of Swift and Mojo doesn't think so, and here's why: (cont'd)

Should we design a new programming language for LLMs, that LLMs can use more efficiently? Chris Lattner, the creator of Swift and Mojo doesn't think so, and here's why: (cont'd)

Gergely Orosz

15,416 Aufrufe • vor 7 Monaten

Introducing GRID: the General Robot Intelligence Development platform, designed for prototyping smart and safe robots rapidly using foundation models, LLMs, and simulation. Paper: Try now: GitHub: 🧵👇(1/N)

Introducing GRID: the General Robot Intelligence Development platform, designed for prototyping smart and safe robots rapidly using foundation models, LLMs, and simulation. Paper: Try now: GitHub: 🧵👇(1/N)

Sai Vemprala

277,281 Aufrufe • vor 2 Jahren

Yann LeCun says language isn’t intelligence. Predicting text doesn’t mean understanding reality. The real world is messy, physical, and causal and today’s LLMs barely touch that. The next leap is Physical AI: world models, cause and effect, real planning. Do you think LLMs can evolve into this, or do we need a completely new architecture?

Yann LeCun says language isn’t intelligence. Predicting text doesn’t mean understanding reality. The real world is messy, physical, and causal and today’s LLMs barely touch that. The next leap is Physical AI: world models, cause and effect, real planning. Do you think LLMs can evolve into this, or do we need a completely new architecture?

VraserX e/acc

76,154 Aufrufe • vor 4 Monaten

Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust visuomotor policies under visual occlusions. 🧵👇

Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust visuomotor policies under visual occlusions. 🧵👇

Haoyu Xiong

122,084 Aufrufe • vor 1 Jahr

🤖 How can robot policies zero-shot generalize to any new environment and any new object? Introducing our new project: 🚀Data Scaling Laws in Imitation Learning for Robotic Manipulation🚀—bringing us closer to the dream of having robots work as waiters in hot pot restaurants! 🍲

🤖 How can robot policies zero-shot generalize to any new environment and any new object? Introducing our new project: 🚀Data Scaling Laws in Imitation Learning for Robotic Manipulation🚀—bringing us closer to the dream of having robots work as waiters in hot pot restaurants! 🍲

Yang Gao

124,995 Aufrufe • vor 1 Jahr

This is a single uncut video, showing a robot learning several tasks instantly, after just one demonstration each ... This is possible because we've now been able to achieve in-context learning for everyday robotics tasks, and I'm very excited to announce our latest paper: 🎆 Instant Policy: In-Context Imitation Learning via Graph Diffusion 🎆 (1/6) 🧵👇

This is a single uncut video, showing a robot learning several tasks instantly, after just one demonstration each ... This is possible because we've now been able to achieve in-context learning for everyday robotics tasks, and I'm very excited to announce our latest paper: 🎆 Instant Policy: In-Context Imitation Learning via Graph Diffusion 🎆 (1/6) 🧵👇

Edward Johns

74,680 Aufrufe • vor 1 Jahr

How can we move beyond static-arm lab setups and learn robot policies in our messy homes? We introduce HoMeR, an imitation learning agent for in-the-wild mobile manipulation. 🧵1/8

How can we move beyond static-arm lab setups and learn robot policies in our messy homes? We introduce HoMeR, an imitation learning agent for in-the-wild mobile manipulation. 🧵1/8

Priya Sundaresan

41,835 Aufrufe • vor 1 Jahr

Large Language Models (LLM) Explained Briefly the best visual explanation that I saw of LLMs, source in comment below

Large Language Models (LLM) Explained Briefly the best visual explanation that I saw of LLMs, source in comment below

Mohit Mishra

15,698 Aufrufe • vor 1 Jahr

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 Aufrufe • vor 2 Jahren

Dario Amodei says pre-training sits somewhere between learning and evolution. Humans inherit priors shaped over millions of years. LLMs start as random weights and distill trillions of tokens into those priors. We describe them using human learning metaphors. But the analogy only goes so far.

Dario Amodei says pre-training sits somewhere between learning and evolution. Humans inherit priors shaped over millions of years. LLMs start as random weights and distill trillions of tokens into those priors. We describe them using human learning metaphors. But the analogy only goes so far.

vitrupo

45,540 Aufrufe • vor 4 Monaten

Yann LeCun says we're fooled by LLMs because they manipulate language well, and we associate that with intelligence But language fluency doesn't mean underlying intelligence Every generation since the 1950s claimed its technique was the ticket to human-level AI All were wrong. "this generation with LLMs is also wrong"

Yann LeCun says we're fooled by LLMs because they manipulate language well, and we associate that with intelligence But language fluency doesn't mean underlying intelligence Every generation since the 1950s claimed its technique was the ticket to human-level AI All were wrong. "this generation with LLMs is also wrong"

Haider.

625,457 Aufrufe • vor 6 Monaten

Quickly turn a GitHub repository into text for LLMs with Gitingest ⚡️ Replace "hub" with "ingest" in any GitHub URL for a text version of the codebase.

Quickly turn a GitHub repository into text for LLMs with Gitingest ⚡️ Replace "hub" with "ingest" in any GitHub URL for a text version of the codebase.

Addy Osmani

285,849 Aufrufe • vor 1 Jahr

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,435 Aufrufe • vor 1 Jahr