Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning),... show more

Kaustubh Sridhar

1,529 subscribers

52,158 Aufrufe • vor 11 Monaten •via X (Twitter)

Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Excited to finally share Generative Value Learning (GVL), my Google DeepMind project on extracting universal value functions from long-context VLMs via in-context learning! We discovered a simple method to generate zero-shot and few-shot values for 300+ robot tasks and 50+ datasets using SOTA VLMs like Gemini (Try out the demo on our website on your robot video today!) I worked a lot on leveraging foundation models as guidance for robots in my PhD, and to me, this result forges a new frontier in how we can use foundation models for robot learning, given its broad applicability independent of embodiment and task types. Quite excited about how we can build on this work as a community!

Excited to finally share Generative Value Learning (GVL), my Google DeepMind project on extracting universal value functions from long-context VLMs via in-context learning! We discovered a simple method to generate zero-shot and few-shot values for 300+ robot tasks and 50+ datasets using SOTA VLMs like Gemini (Try out the demo on our website on your robot video today!) I worked a lot on leveraging foundation models as guidance for robots in my PhD, and to me, this result forges a new frontier in how we can use foundation models for robot learning, given its broad applicability independent of embodiment and task types. Quite excited about how we can build on this work as a community!

Jason Ma

98,090 Aufrufe • vor 1 Jahr

The robot is learning several novel tasks instantly, after just ONE demonstration each... Instant Policy makes it possible: no extra training, no weight updates, just pure in-context learning. It just got accepted at ICLR 2025, and it’s changing how robots learn. With just a single demo, a robot can pick up a new task and start performing it right away. Why this is a big deal: ✅ Learns tasks instantly with just one or a few demonstrations ✅ Improves over time as more demonstrations are given ✅ Uses simulation-based training with “pseudo-demonstrations” for scalability ✅ Can transfer skills across different robots and even follow language-defined tasks It brings in-context learning to robotics, opening up new possibilities for flexible, real-world automation. You can try it yourself: code and weights are available at • • • • Thank you to Edward Johns, Director of the Robot Learning Lab at Imperial College for sharing their work! 🙏

The robot is learning several novel tasks instantly, after just ONE demonstration each... Instant Policy makes it possible: no extra training, no weight updates, just pure in-context learning. It just got accepted at ICLR 2025, and it’s changing how robots learn. With just a single demo, a robot can pick up a new task and start performing it right away. Why this is a big deal: ✅ Learns tasks instantly with just one or a few demonstrations ✅ Improves over time as more demonstrations are given ✅ Uses simulation-based training with “pseudo-demonstrations” for scalability ✅ Can transfer skills across different robots and even follow language-defined tasks It brings in-context learning to robotics, opening up new possibilities for flexible, real-world automation. You can try it yourself: code and weights are available at • • • • Thank you to Edward Johns, Director of the Robot Learning Lab at Imperial College for sharing their work! 🙏

Ilir Aliu

46,285 Aufrufe • vor 1 Jahr

Robot Learning needs 4D world models! Robot Learning needs 4D world models! Robot Learning needs 4D world models! We introduce TesserAct, a 4D embodied world model that can simulate how agents interact with the 3D world over time! We achieve this by simply extending a pre-trained 2D video generation model to jointly predict RGB, depth, and surface normals. It enables: 1️⃣ Much better policy learning in the wild 2️⃣ Temporal + spatial coherence in 4D dynamic prediction 3️⃣ Novel view synthesis for embodied scenes Code: Paper Link: Project page:

Robot Learning needs 4D world models! Robot Learning needs 4D world models! Robot Learning needs 4D world models! We introduce TesserAct, a 4D embodied world model that can simulate how agents interact with the 3D world over time! We achieve this by simply extending a pre-trained 2D video generation model to jointly predict RGB, depth, and surface normals. It enables: 1️⃣ Much better policy learning in the wild 2️⃣ Temporal + spatial coherence in 4D dynamic prediction 3️⃣ Novel view synthesis for embodied scenes Code: Paper Link: Project page:

Chuang Gan

43,265 Aufrufe • vor 1 Jahr

Introducing Reinforcement-Learned Teachers (RLTs): Transforming how we teach LLMs to reason with reinforcement learning (RL). Blog: Paper: Traditional RL focuses on “learning to solve” challenging problems with expensive LLMs and constitutes a key step in making student AI systems ultimately acquire reasoning capabilities via distillation and cold-starting. Enter our RLTs—a new class of models prompted with not only a problem’s question but also its solution, and directly trained to generate clear, step-by-step “explanations” to teach their students. Remarkably, an RLT with only 7B parameters produces superior results when distilling and cold-starting students in competitive and graduate-level reasoning tasks than orders-of-magnitude larger LLMs. RLTs are as effective even when distilling 32B students, much larger than the teacher itself—unlocking a new standard for efficiency in developing reasoning language models with RL. Code:

Introducing Reinforcement-Learned Teachers (RLTs): Transforming how we teach LLMs to reason with reinforcement learning (RL). Blog: Paper: Traditional RL focuses on “learning to solve” challenging problems with expensive LLMs and constitutes a key step in making student AI systems ultimately acquire reasoning capabilities via distillation and cold-starting. Enter our RLTs—a new class of models prompted with not only a problem’s question but also its solution, and directly trained to generate clear, step-by-step “explanations” to teach their students. Remarkably, an RLT with only 7B parameters produces superior results when distilling and cold-starting students in competitive and graduate-level reasoning tasks than orders-of-magnitude larger LLMs. RLTs are as effective even when distilling 32B students, much larger than the teacher itself—unlocking a new standard for efficiency in developing reasoning language models with RL. Code:

Sakana AI

179,276 Aufrufe • vor 1 Jahr

A team tested Pi0, Pi0 Fast, Gr00t, and ACT on real robot arms in manufacturing tasks. (🔖 Bookmark this for later!) The task was precise: place thin rectangular frames from a messy stack into a holder. The team fine-tuned each model on 100 real trajectories and compared training time, inference speed, motion quality, and success rates. ⬇️ Here’s a breakdown of what they found Pi0 (Original) ✅ Strongest overall performance in precise pick-and-place ✅ High success rate even in edge cases ✅ Longest training time (~11 hours, ~$30 per run) ✅ Inference time of 80 ms causes short pauses between actions Despite delays, it handles complex scenarios well… solid for high-precision tasks, but slow to train. Gr00t ✅ Trains fast (~2 hours, ~$5 per run) ✅ Performs almost as well as Pi0 on large-object tasks ✅ Struggles with fine precision; random movement in some trials ✅ More training didn’t fix jitter or random offsets Best suited for tasks where exact precision isn’t critical. Not ready for manufacturing-grade accuracy without more tuning. Pi0 Fast ✅ Promised faster training, but results were underwhelming ✅ Training at 6 hours still showed low success rates ✅ Inference was slower than expected ✅ Not reliable for generalizing even slightly new tasks Currently too unstable for real-world deployment. Doesn’t live up to the “Fast” name yet. ACT (Baseline) ✅ 200MB model—lightweight, but limited ✅ Struggles with stacked objects or ambiguous scenes ✅ Success rates around 70% in best-case setups ✅ Can’t match newer models on precision or generalization Still a solid baseline, but clearly a generation behind in robustness. 🚨 Extra Notes All newer models share a common issue: •Inference takes longer than a frame (80 ms vs 33 ms), so robots “pause” between chunks. •This results in jittery movements, but not a dealbreaker unless tasks are time-sensitive. Language-conditioned tasks also fell short: after training on two labeled tasks, the model couldn’t generalize to a third unseen combination using only text prompts. ✅ The good news? These models adapt well to new robot arms with quick fine-tuning. ❌ The bad news? There’s still no plug-and-play solution for improving performance after deployment. Reinforcement learning or DAgger-style data collection during real-world operation may be the next big step, something many teams in robotics are actively working on.

A team tested Pi0, Pi0 Fast, Gr00t, and ACT on real robot arms in manufacturing tasks. (🔖 Bookmark this for later!) The task was precise: place thin rectangular frames from a messy stack into a holder. The team fine-tuned each model on 100 real trajectories and compared training time, inference speed, motion quality, and success rates. ⬇️ Here’s a breakdown of what they found Pi0 (Original) ✅ Strongest overall performance in precise pick-and-place ✅ High success rate even in edge cases ✅ Longest training time (~11 hours, ~$30 per run) ✅ Inference time of 80 ms causes short pauses between actions Despite delays, it handles complex scenarios well… solid for high-precision tasks, but slow to train. Gr00t ✅ Trains fast (~2 hours, ~$5 per run) ✅ Performs almost as well as Pi0 on large-object tasks ✅ Struggles with fine precision; random movement in some trials ✅ More training didn’t fix jitter or random offsets Best suited for tasks where exact precision isn’t critical. Not ready for manufacturing-grade accuracy without more tuning. Pi0 Fast ✅ Promised faster training, but results were underwhelming ✅ Training at 6 hours still showed low success rates ✅ Inference was slower than expected ✅ Not reliable for generalizing even slightly new tasks Currently too unstable for real-world deployment. Doesn’t live up to the “Fast” name yet. ACT (Baseline) ✅ 200MB model—lightweight, but limited ✅ Struggles with stacked objects or ambiguous scenes ✅ Success rates around 70% in best-case setups ✅ Can’t match newer models on precision or generalization Still a solid baseline, but clearly a generation behind in robustness. 🚨 Extra Notes All newer models share a common issue: •Inference takes longer than a frame (80 ms vs 33 ms), so robots “pause” between chunks. •This results in jittery movements, but not a dealbreaker unless tasks are time-sensitive. Language-conditioned tasks also fell short: after training on two labeled tasks, the model couldn’t generalize to a third unseen combination using only text prompts. ✅ The good news? These models adapt well to new robot arms with quick fine-tuning. ❌ The bad news? There’s still no plug-and-play solution for improving performance after deployment. Reinforcement learning or DAgger-style data collection during real-world operation may be the next big step, something many teams in robotics are actively working on.

Ilir Aliu

21,844 Aufrufe • vor 1 Jahr

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,708 Aufrufe • vor 3 Jahren

Training robot foundation models faces two key hurdles: how to get enough data to train an effective model, and how to make sure that new skills can be acquired quickly. The team at Rhoda AI believes that the answer is training Direct Video Action models from web data. Web data is plentiful, to the point where Rhoda can train their base model on hundreds of years of video data. And then, with the addition of robot data, they can quickly adapt it to new tasks with as little as 20 hours of in-domain data, performing complex, multi-step manipulation tasks with their purpose-built video foundation model. Tongzhou Mu 🤖🦾🦿 Eric Chan and Changan Chen joined us to talk more about their approach. Watch Episode #79 of RoboPapers, with Michael Cho - Rbt/Acc, Chris Paxton, and Jiafei Duan, to learn more!

Training robot foundation models faces two key hurdles: how to get enough data to train an effective model, and how to make sure that new skills can be acquired quickly. The team at Rhoda AI believes that the answer is training Direct Video Action models from web data. Web data is plentiful, to the point where Rhoda can train their base model on hundreds of years of video data. And then, with the addition of robot data, they can quickly adapt it to new tasks with as little as 20 hours of in-domain data, performing complex, multi-step manipulation tasks with their purpose-built video foundation model. Tongzhou Mu 🤖🦾🦿 Eric Chan and Changan Chen joined us to talk more about their approach. Watch Episode #79 of RoboPapers, with Michael Cho - Rbt/Acc, Chris Paxton, and Jiafei Duan, to learn more!

RoboPapers

24,475 Aufrufe • vor 2 Monaten

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

AI at Meta

309,942 Aufrufe • vor 1 Jahr

Tesla Optimus can arrange batteries in their factories, ours can do skincare (on Yuzhe Qin)! We opensource Bunny-VisionPro, a teleoperation system for bimanual hand manipulation. The users can control the robot hands in real time using VisionPro, flexible like a bunny. 🐇 We also have kitchen tasks, playing Rubik's Cube, and dynamic motion tasks. Imitation learning policies are trained on sweeping with a broom, serving a drink, and wiping glasses. Check our website for more details: The project is led by Runyu Ding Runyu Ding, Yuzhe Qin Yuzhe Qin , and Jiyue Zhu

Tesla Optimus can arrange batteries in their factories, ours can do skincare (on Yuzhe Qin)! We opensource Bunny-VisionPro, a teleoperation system for bimanual hand manipulation. The users can control the robot hands in real time using VisionPro, flexible like a bunny. 🐇 We also have kitchen tasks, playing Rubik's Cube, and dynamic motion tasks. Imitation learning policies are trained on sweeping with a broom, serving a drink, and wiping glasses. Check our website for more details: The project is led by Runyu Ding Runyu Ding, Yuzhe Qin Yuzhe Qin , and Jiyue Zhu

Xiaolong Wang

90,902 Aufrufe • vor 2 Jahren

The next evolution: VLA+ models Just yesterday Microsoft Research released Rho-alpha (ρα) – their first robotics model, built on the Phi family. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds: ▪️ Tactile sensing to feel objects during manipulation ▪️ Online learning that lets it improve from human corrections (via teleoperation, 3D mouse or other tools) in real-time even after deployment. Both these sides make adaptability central rather than incidental. Microsoft calls it a VLA+ model, positioning it as an extension beyond what current VLA systems support. ➡️ Today Rho-alpha can control dual-arm robot setups to perform tasks such as: • Manipulating the BusyBox following natural-language instructions • Plug insertion • Toolbox packing and object arrangement with bimanual coordination But to understand why this "plus" matters, we need to understand what came before. Here, we'll take you through the entire landscape of VLA models – Gemini Robotics, π0, SmolVLA, Helix, ACoT-VLA and others:

The next evolution: VLA+ models Just yesterday Microsoft Research released Rho-alpha (ρα) – their first robotics model, built on the Phi family. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds: ▪️ Tactile sensing to feel objects during manipulation ▪️ Online learning that lets it improve from human corrections (via teleoperation, 3D mouse or other tools) in real-time even after deployment. Both these sides make adaptability central rather than incidental. Microsoft calls it a VLA+ model, positioning it as an extension beyond what current VLA systems support. ➡️ Today Rho-alpha can control dual-arm robot setups to perform tasks such as: • Manipulating the BusyBox following natural-language instructions • Plug insertion • Toolbox packing and object arrangement with bimanual coordination But to understand why this "plus" matters, we need to understand what came before. Here, we'll take you through the entire landscape of VLA models – Gemini Robotics, π0, SmolVLA, Helix, ACoT-VLA and others:

Turing Post

62,362 Aufrufe • vor 5 Monaten

Excited to present the LLM-Council skill. Initial idea by Karpathy. I just packaged it as a skill. You can easily spin up a council of LLMs or agents via Fireworks AI. Watch how the new GLM-5 model "deliberates" on other LLMs' thoughts on the big question, "Can LLMs reason?" Things worth paying attention to: New open models like GLM-5 have surprisingly improved on complex reasoning and long-running agentic tasks. The AskUserQuestion tool in Claude Code came in handy to select the council and chairperson. As Andrej Karpathy puts it, it's a really interesting way to get different perspectives from LLMs, which can lead to better decision-making on whatever task you are working on. You can use it for other agentic coding use cases, like evaluation, tool building, designing, and research.

Excited to present the LLM-Council skill. Initial idea by Karpathy. I just packaged it as a skill. You can easily spin up a council of LLMs or agents via Fireworks AI. Watch how the new GLM-5 model "deliberates" on other LLMs' thoughts on the big question, "Can LLMs reason?" Things worth paying attention to: New open models like GLM-5 have surprisingly improved on complex reasoning and long-running agentic tasks. The AskUserQuestion tool in Claude Code came in handy to select the council and chairperson. As Andrej Karpathy puts it, it's a really interesting way to get different perspectives from LLMs, which can lead to better decision-making on whatever task you are working on. You can use it for other agentic coding use cases, like evaluation, tool building, designing, and research.

elvis

39,452 Aufrufe • vor 5 Monaten

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 Aufrufe • vor 3 Jahren

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 Aufrufe • vor 3 Jahren

.Richard Sutton, father of reinforcement learning, doesn’t think LLMs are bitter-lesson-pilled. My steel man of Richard’s position: we need some new architecture to enable continual (on-the-job) learning. And if we have continual learning, we don't need a special training phase - the agent just learns on-the-fly - like all humans, and indeed, like all animals. This new paradigm will render our current approach with LLMs obsolete. I did my best to represent the view that LLMs will function as the foundation on which this experiential learning can happen. Some sparks flew. 0:00:00 – Are LLMs a dead-end? 0:13:51 – Do humans do imitation learning? 0:23:57 – The Era of Experience 0:34:25 – Current architectures generalize poorly out of distribution 0:42:17 – Surprises in the AI field 0:47:28 – Will The Bitter Lesson still apply after AGI? 0:54:35 – Succession to AI

.Richard Sutton, father of reinforcement learning, doesn’t think LLMs are bitter-lesson-pilled. My steel man of Richard’s position: we need some new architecture to enable continual (on-the-job) learning. And if we have continual learning, we don't need a special training phase - the agent just learns on-the-fly - like all humans, and indeed, like all animals. This new paradigm will render our current approach with LLMs obsolete. I did my best to represent the view that LLMs will function as the foundation on which this experiential learning can happen. Some sparks flew. 0:00:00 – Are LLMs a dead-end? 0:13:51 – Do humans do imitation learning? 0:23:57 – The Era of Experience 0:34:25 – Current architectures generalize poorly out of distribution 0:42:17 – Surprises in the AI field 0:47:28 – Will The Bitter Lesson still apply after AGI? 0:54:35 – Succession to AI

Dwarkesh Patel

3,081,359 Aufrufe • vor 9 Monaten

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Xiao Ma

93,811 Aufrufe • vor 6 Monaten

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

AK

429,307 Aufrufe • vor 3 Jahren

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The Humanoid Hub

68,453 Aufrufe • vor 6 Monaten

In my experience, robot 'generalists' are often jacks of all trades but masters of none. In training across multiple tasks and environments, robot policies fail to generalize robustly and effectively to each particular test setting. What if at test time, we non-parametrically *retrieved* “relevant” data from the training set and used it to significantly improve the performance of few-shot imitation learning to be robust to various test time scenes. Notably, we are *not* collecting lots of new data, just training more on sub-components of the same training data! Now, we’re certainly not the first to suggest retrieval, but in our new work - STRAP, we show how retrieving relevant *sub-trajectories* from offline datasets can significantly increase data reuse across tasks, when paired with an appropriate metric space. A 🧵 (1/7)

In my experience, robot 'generalists' are often jacks of all trades but masters of none. In training across multiple tasks and environments, robot policies fail to generalize robustly and effectively to each particular test setting. What if at test time, we non-parametrically retrieved “relevant” data from the training set and used it to significantly improve the performance of few-shot imitation learning to be robust to various test time scenes. Notably, we are not collecting lots of new data, just training more on sub-components of the same training data! Now, we’re certainly not the first to suggest retrieval, but in our new work - STRAP, we show how retrieving relevant sub-trajectories from offline datasets can significantly increase data reuse across tasks, when paired with an appropriate metric space. A 🧵 (1/7)

Abhishek Gupta

12,045 Aufrufe • vor 1 Jahr

Can vision-language-action (VLA) models generalize to diverse OOD tasks and align with customized objectives? 🤔 🚀 We introduce GRAPE, a plug-and-play algorithm to generalize robot policies via preference alignment. GRAPE unfolds three benefits to boost the generalizability of VLAs: 👉1. GRAPE aligns VLAs on a trajectory level and endows the model with the ability for global decision-making, instead of merely cloning behavior; 👉2. GRAPE implicitly models reward from both successful and failed trials to boost generalizability to diverse tasks; 👉3. GRAPE adopts a scalable preference synthesis algorithm to rank trajectories with preferences that align with arbitrary objectives. Our experiments on a diverse array of real-world and simulated robotic tasks reveal: 1⃣GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%; 2⃣GRAPE is versatile to be aligned with diverse objectives and reduce collision rates by 44.31% or rollout length by 11.15% when aligning towards safer or more efficient manipulation policy, respectively. Check out our full project for more details: 🔥 Paper: 🔥 Project: 🔥 Code:

Can vision-language-action (VLA) models generalize to diverse OOD tasks and align with customized objectives? 🤔 🚀 We introduce GRAPE, a plug-and-play algorithm to generalize robot policies via preference alignment. GRAPE unfolds three benefits to boost the generalizability of VLAs: 👉1. GRAPE aligns VLAs on a trajectory level and endows the model with the ability for global decision-making, instead of merely cloning behavior; 👉2. GRAPE implicitly models reward from both successful and failed trials to boost generalizability to diverse tasks; 👉3. GRAPE adopts a scalable preference synthesis algorithm to rank trajectories with preferences that align with arbitrary objectives. Our experiments on a diverse array of real-world and simulated robotic tasks reveal: 1⃣GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%; 2⃣GRAPE is versatile to be aligned with diverse objectives and reduce collision rates by 44.31% or rollout length by 11.15% when aligning towards safer or more efficient manipulation policy, respectively. Check out our full project for more details: 🔥 Paper: 🔥 Project: 🔥 Code:

Huaxiu Yao

19,988 Aufrufe • vor 1 Jahr

Every wondered if we can model motion as a language? can we tokenize this new language? is it useful? Turns out tremendously! 🚀 In out latest #NeurIPS2024 paper on QueST: Self-Supervised Skill Abstractions for Learning Continuous Control, we find that action tokenization matters a lot! We can learn skill encodings by representing temporal action abstractions with a discrete codebook. This enables 2 things 1. Better Behaviour Cloning: we can better assimilate multi-task data (>9%) over best paper. This is currently best in class BC method! 2. generalization of this language to represent new tasks in 5-shot transfer to longer horizon tasks! Check out the thread by Atharva Mete for more details. And check out more details at: Joint work with Atharva Mete Albert Wilcox Haotian Xue Yongxin Chen Georgia Tech School of Interactive Computing Machine Learning at Georgia Tech Robotics@GT NVIDIA Robotics

Every wondered if we can model motion as a language? can we tokenize this new language? is it useful? Turns out tremendously! 🚀 In out latest #NeurIPS2024 paper on QueST: Self-Supervised Skill Abstractions for Learning Continuous Control, we find that action tokenization matters a lot! We can learn skill encodings by representing temporal action abstractions with a discrete codebook. This enables 2 things 1. Better Behaviour Cloning: we can better assimilate multi-task data (>9%) over best paper. This is currently best in class BC method! 2. generalization of this language to represent new tasks in 5-shot transfer to longer horizon tasks! Check out the thread by Atharva Mete for more details. And check out more details at: Joint work with Atharva Mete Albert Wilcox Haotian Xue Yongxin Chen Georgia Tech School of Interactive Computing Machine Learning at Georgia Tech Robotics@GT NVIDIA Robotics

Animesh Garg

26,218 Aufrufe • vor 1 Jahr