Loading video...

Video Failed to Load

Go Home

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning),...

52,158 views • 9 months ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

A team tested Pi0, Pi0 Fast, Gr00t, and ACT on real robot arms in manufacturing tasks. (🔖 Bookmark this for later!) The task was precise: place thin rectangular frames from a messy stack into a holder. The team fine-tuned each model on 100 real trajectories and compared training time, inference speed, motion quality, and success rates. ⬇️ Here’s a breakdown of what they found Pi0 (Original) ✅ Strongest overall performance in precise pick-and-place ✅ High success rate even in edge cases ✅ Longest training time (~11 hours, ~$30 per run) ✅ Inference time of 80 ms causes short pauses between actions Despite delays, it handles complex scenarios well… solid for high-precision tasks, but slow to train. Gr00t ✅ Trains fast (~2 hours, ~$5 per run) ✅ Performs almost as well as Pi0 on large-object tasks ✅ Struggles with fine precision; random movement in some trials ✅ More training didn’t fix jitter or random offsets Best suited for tasks where exact precision isn’t critical. Not ready for manufacturing-grade accuracy without more tuning. Pi0 Fast ✅ Promised faster training, but results were underwhelming ✅ Training at 6 hours still showed low success rates ✅ Inference was slower than expected ✅ Not reliable for generalizing even slightly new tasks Currently too unstable for real-world deployment. Doesn’t live up to the “Fast” name yet. ACT (Baseline) ✅ 200MB model—lightweight, but limited ✅ Struggles with stacked objects or ambiguous scenes ✅ Success rates around 70% in best-case setups ✅ Can’t match newer models on precision or generalization Still a solid baseline, but clearly a generation behind in robustness. 🚨 Extra Notes All newer models share a common issue: •Inference takes longer than a frame (80 ms vs 33 ms), so robots “pause” between chunks. •This results in jittery movements, but not a dealbreaker unless tasks are time-sensitive. Language-conditioned tasks also fell short: after training on two labeled tasks, the model couldn’t generalize to a third unseen combination using only text prompts. ✅ The good news? These models adapt well to new robot arms with quick fine-tuning. ❌ The bad news? There’s still no plug-and-play solution for improving performance after deployment. Reinforcement learning or DAgger-style data collection during real-world operation may be the next big step, something many teams in robotics are actively working on.

Ilir Aliu - eu/acc

21,703 views • 1 year ago

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,259 views • 2 years ago