Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into *in-context, low-level imitation learning machines*. 🚀 Let me explain. 👇🧵

23,094 Aufrufe • vor 2 Jahren •via X (Twitter)

11 Kommentare

Profilbild von Norman Di Palo
Norman Di Palovor 2 Jahren

We represent visual inputs as keypoint tokens, and robot poses as action tokens. Keypoints are extracted via DINO, projected in 3D, and represented as text. The gripper pose is represented as a triplet of 3D points in the same euclidean space as the keypoints. There's more👇🧵

Profilbild von Norman Di Palo
Norman Di Palovor 2 Jahren

Recording a demonstration means storing keypoint tokens from the scene and the action tokens of the trajectory. The numbers are represented as text (tokens), and can therefore go into an LLM prompt. This is where things get interesting... 👇🧵

Profilbild von Norman Di Palo
Norman Di Palovor 2 Jahren

With the Keypoint Action Tokens "language", we unlock the pattern extraction and completion abilities of LLMs, from language to robotics. Given keypoint tokens from a new scene, we demonstrate that LLMs can output an effective trajectory of action tokens! 👇🧵

Profilbild von Norman Di Palo
Norman Di Palovor 2 Jahren

Our experiments demonstrate that: 1) KAT surpasses SOTA methods like Diffusion Policies (DP) 2) the KAT representation can boost DP performance as well, improving on the end-to-end version KAT is also robust to distractors, as we show in the third execution of the video. 👇🧵

Profilbild von Norman Di Palo
Norman Di Palovor 2 Jahren

A fascinating result is that KAT's performance improves as the underlying LLMs improve. Using GPT-4 Turbo outperforms GPT-3.5 Turbo. This suggests that the improvement and scaling of LLMs can lead to free improvement to robotics, using LLMs as imitation learning machines. 👇🧵

Profilbild von Norman Di Palo
Norman Di Palovor 2 Jahren

In-context learning also means *zero training time*, and the ability to deploy and test skills immediately, or store them as text and retrieve them later. All our pipeline relies on pre-trained Foundation Models and can be deployed immediately. 👇🧵

Profilbild von Norman Di Palo
Norman Di Palovor 2 Jahren

Therefore, we show one of the most interesting uses of LLMs in robotics is *without any natural language*. Surprising! 💡 Supervised by @Ed__Johns. Paper: Website with videos:

Profilbild von XXXin
XXXinvor 1 Jahr

Does it mean the user gives no prompt but the debit will determine which task to perform based on the initial scene configuration? What comes next?

Profilbild von Norman Di Palo
Norman Di Palovor 1 Jahr

This method was trained on single tasks in isolation. Extending it to receive language commands has been done eg here

Profilbild von ऋषिक तिवारी (Rishik Tiwari)
ऋषिक तिवारी (Rishik Tiwari)vor 2 Jahren

Really interesting, I lately have been trying to do something similar but with egocentric cam and affordance maps (inspired from VoxPoser). While KAT may work for topdown RGBD, ego views will be difficult as it gets obstructed after grasping the obj.

Profilbild von Norman Di Palo
Norman Di Palovor 2 Jahren

Interesting! I worked quite a bit with wrist-cams and they offer advantages and disadvantages. I am currently working on some wrist-cam based extensions.

Ähnliche Videos

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 Aufrufe • vor 2 Jahren