Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into *in-context, low-level imitation learning machines*. 🚀 Let me explain. 👇🧵

23,095 görüntüleme • 2 yıl önce •via X (Twitter)

11 Yorum

Norman Di Palo profil fotoğrafı
Norman Di Palo2 yıl önce

We represent visual inputs as keypoint tokens, and robot poses as action tokens. Keypoints are extracted via DINO, projected in 3D, and represented as text. The gripper pose is represented as a triplet of 3D points in the same euclidean space as the keypoints. There's more👇🧵

Norman Di Palo profil fotoğrafı
Norman Di Palo2 yıl önce

Recording a demonstration means storing keypoint tokens from the scene and the action tokens of the trajectory. The numbers are represented as text (tokens), and can therefore go into an LLM prompt. This is where things get interesting... 👇🧵

Norman Di Palo profil fotoğrafı
Norman Di Palo2 yıl önce

With the Keypoint Action Tokens "language", we unlock the pattern extraction and completion abilities of LLMs, from language to robotics. Given keypoint tokens from a new scene, we demonstrate that LLMs can output an effective trajectory of action tokens! 👇🧵

Norman Di Palo profil fotoğrafı
Norman Di Palo2 yıl önce

Our experiments demonstrate that: 1) KAT surpasses SOTA methods like Diffusion Policies (DP) 2) the KAT representation can boost DP performance as well, improving on the end-to-end version KAT is also robust to distractors, as we show in the third execution of the video. 👇🧵

Norman Di Palo profil fotoğrafı
Norman Di Palo2 yıl önce

A fascinating result is that KAT's performance improves as the underlying LLMs improve. Using GPT-4 Turbo outperforms GPT-3.5 Turbo. This suggests that the improvement and scaling of LLMs can lead to free improvement to robotics, using LLMs as imitation learning machines. 👇🧵

Norman Di Palo profil fotoğrafı
Norman Di Palo2 yıl önce

In-context learning also means *zero training time*, and the ability to deploy and test skills immediately, or store them as text and retrieve them later. All our pipeline relies on pre-trained Foundation Models and can be deployed immediately. 👇🧵

Norman Di Palo profil fotoğrafı
Norman Di Palo2 yıl önce

Therefore, we show one of the most interesting uses of LLMs in robotics is *without any natural language*. Surprising! 💡 Supervised by @Ed__Johns. Paper: Website with videos:

XXXin profil fotoğrafı
XXXin1 yıl önce

Does it mean the user gives no prompt but the debit will determine which task to perform based on the initial scene configuration? What comes next?

Norman Di Palo profil fotoğrafı
Norman Di Palo1 yıl önce

This method was trained on single tasks in isolation. Extending it to receive language commands has been done eg here

ऋषिक तिवारी (Rishik Tiwari) profil fotoğrafı
ऋषिक तिवारी (Rishik Tiwari)2 yıl önce

Really interesting, I lately have been trying to do something similar but with egocentric cam and affordance maps (inspired from VoxPoser). While KAT may work for topdown RGBD, ego views will be difficult as it gets obstructed after grasping the obj.

Norman Di Palo profil fotoğrafı
Norman Di Palo2 yıl önce

Interesting! I worked quite a bit with wrist-cams and they offer advantages and disadvantages. I am currently working on some wrist-cam based extensions.

Benzer Videolar

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 görüntüleme • 2 yıl önce