Video wird geladen...
Video konnte nicht geladen werden
✨ Introducing Keypoint Action Tokens. 🤖 We translate visual observations and robot actions into a "language" that off-the-shelf LLMs can ingest and output. This transforms LLMs into *in-context, low-level imitation learning machines*. 🚀 Let me explain. 👇🧵
23,094 Aufrufe • vor 2 Jahren •via X (Twitter)
11 Kommentare

We represent visual inputs as keypoint tokens, and robot poses as action tokens. Keypoints are extracted via DINO, projected in 3D, and represented as text. The gripper pose is represented as a triplet of 3D points in the same euclidean space as the keypoints. There's more👇🧵

Recording a demonstration means storing keypoint tokens from the scene and the action tokens of the trajectory. The numbers are represented as text (tokens), and can therefore go into an LLM prompt. This is where things get interesting... 👇🧵

With the Keypoint Action Tokens "language", we unlock the pattern extraction and completion abilities of LLMs, from language to robotics. Given keypoint tokens from a new scene, we demonstrate that LLMs can output an effective trajectory of action tokens! 👇🧵

Our experiments demonstrate that: 1) KAT surpasses SOTA methods like Diffusion Policies (DP) 2) the KAT representation can boost DP performance as well, improving on the end-to-end version KAT is also robust to distractors, as we show in the third execution of the video. 👇🧵

A fascinating result is that KAT's performance improves as the underlying LLMs improve. Using GPT-4 Turbo outperforms GPT-3.5 Turbo. This suggests that the improvement and scaling of LLMs can lead to free improvement to robotics, using LLMs as imitation learning machines. 👇🧵

In-context learning also means *zero training time*, and the ability to deploy and test skills immediately, or store them as text and retrieve them later. All our pipeline relies on pre-trained Foundation Models and can be deployed immediately. 👇🧵

Therefore, we show one of the most interesting uses of LLMs in robotics is *without any natural language*. Surprising! 💡 Supervised by @Ed__Johns. Paper: Website with videos:

Does it mean the user gives no prompt but the debit will determine which task to perform based on the initial scene configuration? What comes next?

This method was trained on single tasks in isolation. Extending it to receive language commands has been done eg here

Really interesting, I lately have been trying to do something similar but with egocentric cam and affordance maps (inspired from VoxPoser). While KAT may work for topdown RGBD, ego views will be difficult as it gets obstructed after grasping the obj.

Interesting! I worked quite a bit with wrist-cams and they offer advantages and disadvantages. I am currently working on some wrist-cam based extensions.
