Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Can vision-language-action (VLA) models generalize to diverse OOD tasks and align with customized objectives? 🤔 🚀 We introduce GRAPE, a plug-and-play algorithm to generalize robot policies via preference alignment. GRAPE unfolds three benefits to boost the generalizability of VLAs: 👉1. GRAPE aligns VLAs on a trajectory level and endows... the model with the ability for global decision-making, instead of merely cloning behavior; 👉2. GRAPE implicitly models reward from both successful and failed trials to boost generalizability to diverse tasks; 👉3. GRAPE adopts a scalable preference synthesis algorithm to rank trajectories with preferences that align with arbitrary objectives. Our experiments on a diverse array of real-world and simulated robotic tasks reveal: 1⃣GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%; 2⃣GRAPE is versatile to be aligned with diverse objectives and reduce collision rates by 44.31% or rollout length by 11.15% when aligning towards safer or more efficient manipulation policy, respectively. Check out our full project for more details: 🔥 Paper: 🔥 Project: 🔥 Code:show more

Huaxiu Yao

4,814 subscribers

19,988 views • 1 year ago •via X (Twitter)

Science & Technology Health & Wellness Education

Anya Rossi• Live Now

Private livecam show

7 Comments

Huaxiu Yao1 year ago

[2/N] Detailed Method 1️⃣ Trajectory-wise Preference Optimization: GRAPE scales up step-wise VLAs and trains with a trajectory-wise objective, aligning policies globally by learning from both successes and failures. 2️⃣Customized Preference Synthesis: GRAPE breaks down complex tasks into stages, guided by spatiotemporal constraints from VL models. Flexibly aligns for arbitrary objectives, such as safety, efficiency, or task success. 3️⃣ Iterative Online Alignment: GRAPE refines the alignment process through iterative cycles of 1) online sample collection, 2) synthetic preference ranking, and 3) trajectory-wise preference optimization.

Huaxiu Yao1 year ago

[3/N] Empirical Takeaway 1: Stronger generalizability on a wide array of OOD tasks. 1️⃣ Real-world OOD tasks GRAPE crushes OpenVLA-SFT in generalization: - Visual (new visual environments) 🌆: +20.7% - Subject (unseen objects) 🔍: +27.5% - Action (unseen actions)🏃: +10.0% - Semantic (unseen prompts)🧠: +5.0% - Language grounding (objects in unseen spatial positions)🌍: +26.7% 2️⃣ Simulation OOD tasks In Simpler-Env, GRAPE shines: - Subject (unseen objects) 🔍: +8.0% - Physical (unseen object sizes/shapes) 🏗️: +12.3% - Semantic (unseen prompts)🧠: +19.0%

Huaxiu Yao1 year ago

[4/N] Empirical Takeaway 2: Versatility to align towards customized alignment objectives. GRAPE excels at aligning robot policies with diverse natural language goals: ✅ Task completion ✅ Safety ✅ Cost-efficiency Results: - 🚧 Safer policies: -44.31% collisions - ⏳ Efficient policies: -11.15% rollout lengths

Huaxiu Yao1 year ago

[5/N] Nice work, @ZijianZhangNLP , Kyle Zheng, and nice collab. w/ @ZRChen_AISafety , @jang_yoel , @Yi_Li_UW , @chaoqi_w , @dingmyu , @fox_dieter17849

ZijianZhangNLP1 year ago

Cool work！Thank Prof. Yao and our nice collab!

zihao zheng1 year ago

what is the required training resources？

Daniel Butler1 year ago

GRAPE seems like a promising leap for VLA models in robotics! The trajectory-level preference alignment and reward modeling are particularly intriguing for enabling safer, more efficient, and task-diverse applications. How scalable is GRAPE to real-world multi-agent environments?

Related Videos

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,435 views • 1 year ago

The Clone hand is a compact, biomimetic robotic hand designed to mimic human anatomy and perform a diverse array of manipulation tasks. Credit: Clone Robotics

The Clone hand is a compact, biomimetic robotic hand designed to mimic human anatomy and perform a diverse array of manipulation tasks. Credit: Clone Robotics

Tech Burrito

13,799 views • 1 year ago

Real-world robot data is expensive and slow to collect, creating a major challenge for humanoid development. 🤖 The NVIDIA GR00T N1.6 open vision language action model is pre-trained on a diverse mix of data, including thousands of hours of Stanford Vision and Learning Lab’s BEHAVIOR simulation data, which covers long-horizon everyday manipulation tasks. This diverse training is the key to robust cross-embodiment performance and real-world adaptability. 🌍 Read the blog 🔗

Real-world robot data is expensive and slow to collect, creating a major challenge for humanoid development. 🤖 The NVIDIA GR00T N1.6 open vision language action model is pre-trained on a diverse mix of data, including thousands of hours of Stanford Vision and Learning Lab’s BEHAVIOR simulation data, which covers long-horizon everyday manipulation tasks. This diverse training is the key to robust cross-embodiment performance and real-world adaptability. 🌍 Read the blog 🔗

NVIDIA Robotics

13,421 views • 5 months ago

We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the multitask learning aspect of multimodal models has really taken a step forward. We can train a single model on many diverse tasks with ~SOTA accuracy. But a long way to go in terms of transfer/emergence. 🌐 ⌨️ Joint work w/ EPFL Apple.

We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the multitask learning aspect of multimodal models has really taken a step forward. We can train a single model on many diverse tasks with ~SOTA accuracy. But a long way to go in terms of transfer/emergence. 🌐 ⌨️ Joint work w/ EPFL Apple.

Amir Zamir

69,241 views • 2 years ago

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

Xiao Ma

46,260 views • 11 months ago

Big thanks to AK for highlighting our work! LEO marks our pioneering step towards building an embodied generalist agent that can really comprehend the 3D world! 🚀Leveraging LLMs, we train LEO with real and synthetic 3D data across a diverse spectrum of tasks. It's thrilling to see LEO surpass current state-of-the-art SOTA methods in most benchmarked tasks, all under a single, unified model. 🔥 #Generalist_Agent

Big thanks to AK for highlighting our work! LEO marks our pioneering step towards building an embodied generalist agent that can really comprehend the 3D world! 🚀Leveraging LLMs, we train LEO with real and synthetic 3D data across a diverse spectrum of tasks. It's thrilling to see LEO surpass current state-of-the-art SOTA methods in most benchmarked tasks, all under a single, unified model. 🔥 #Generalist_Agent

Siyuan Huang

22,710 views • 2 years ago

NVIDIA researchers present SONIC, a generalist humanoid controller: It scales motion tracking on a single policy to achieve natural, robust whole-body movement. The scalable foundation avoids manual reward engineering and features a universal token space and kinematic planner to seamlessly integrate diverse inputs, including VR/video teleoperation and VLA foundation models, for real-world tasks.

NVIDIA researchers present SONIC, a generalist humanoid controller: It scales motion tracking on a single policy to achieve natural, robust whole-body movement. The scalable foundation avoids manual reward engineering and features a universal token space and kinematic planner to seamlessly integrate diverse inputs, including VR/video teleoperation and VLA foundation models, for real-world tasks.

The Humanoid Hub

31,445 views • 7 months ago

🧐Applying world models to improve real-world policy on challenging manipulation tasks used to be considered out of reach. 😌After sustained effort, we’re now seeing encouraging progress. 🚀Thrilled to introduce RISE: Self-Improving Robot Policy with Compositional World Model RISE is, to our knowledge, the first work to use a world model as an effective learning environment for challenging real-world manipulation, enabling policy improvement on tasks that demand high dynamics, dexterity, and precision. Incredible teamwork with Kunyang Lin Hongyang Li Yue Xiangyu Hao Zhao Chonghao Sima

🧐Applying world models to improve real-world policy on challenging manipulation tasks used to be considered out of reach. 😌After sustained effort, we’re now seeing encouraging progress. 🚀Thrilled to introduce RISE: Self-Improving Robot Policy with Compositional World Model RISE is, to our knowledge, the first work to use a world model as an effective learning environment for challenging real-world manipulation, enabling policy improvement on tasks that demand high dynamics, dexterity, and precision. Incredible teamwork with Kunyang Lin Hongyang Li Yue Xiangyu Hao Zhao Chonghao Sima

Jiazhi Yang

75,955 views • 4 months ago

Not the flashiest demos, but what’s under the hood represents a foundational shift for general-purpose robotics. World models are the next-gen foundation of Physical AI, not the VLM backbones found in typical VLAs. DreamZero is a 14B-parameter World Action Model (WAM) by NVIDIA that treats robotics as a joint video-and-action prediction task. Unlike traditional Vision-Language-Action (VLA) models that map images directly to motor commands, DreamZero leverages a pretrained video diffusion backbone to predict future world states and actions simultaneously. - achieves 2× better zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs. - learns effectively from heterogeneous, non-repetitive data (500 hours), breaking the need for thousands of repeated demonstrations. - adapts to new robot embodiments with just 30 minutes of play data. - enables 7Hz closed-loop control via system optimizations and "DreamZero-Flash," making high-capacity diffusion models viable for real-time use.

Not the flashiest demos, but what’s under the hood represents a foundational shift for general-purpose robotics. World models are the next-gen foundation of Physical AI, not the VLM backbones found in typical VLAs. DreamZero is a 14B-parameter World Action Model (WAM) by NVIDIA that treats robotics as a joint video-and-action prediction task. Unlike traditional Vision-Language-Action (VLA) models that map images directly to motor commands, DreamZero leverages a pretrained video diffusion backbone to predict future world states and actions simultaneously. - achieves 2× better zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs. - learns effectively from heterogeneous, non-repetitive data (500 hours), breaking the need for thousands of repeated demonstrations. - adapts to new robot embodiments with just 30 minutes of play data. - enables 7Hz closed-loop control via system optimizations and "DreamZero-Flash," making high-capacity diffusion models viable for real-time use.

The Humanoid Hub

35,204 views • 4 months ago

We’re redefining what’s possible with AI. With the release of our latest model, Command A, optimized for real-world agentic and multilingual tasks, we’re demonstrating our commitment to bringing enterprises AI that goes beyond the ordinary, and offers security & efficiency. Our team has developed highly capable and efficient models that can be run on just 2 GPUs. Check out our tech report to learn more:

We’re redefining what’s possible with AI. With the release of our latest model, Command A, optimized for real-world agentic and multilingual tasks, we’re demonstrating our commitment to bringing enterprises AI that goes beyond the ordinary, and offers security & efficiency. Our team has developed highly capable and efficient models that can be run on just 2 GPUs. Check out our tech report to learn more:

Cohere

16,413 views • 1 year ago

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Kaustubh Sridhar

52,158 views • 10 months ago

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Xiao Ma

93,692 views • 5 months ago

VLAExplain — Interpreting Vision-Language-Action (VLA) Models VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models. Currently, attention analysis is supported for both the pi05 and unifolm-vla models. For details, please check pi05 and UnifoLM-VLA readme files respectively. Demo of pi05 in action:

VLAExplain — Interpreting Vision-Language-Action (VLA) Models VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models. Currently, attention analysis is supported for both the pi05 and unifolm-vla models. For details, please check pi05 and UnifoLM-VLA readme files respectively. Demo of pi05 in action:

Ryohei Sasaki@engineer

12,774 views • 2 months ago

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 views • 2 years ago

Meta presents Sapiens Foundation for Human Vision Models discuss: We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability - model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.

Meta presents Sapiens Foundation for Human Vision Models discuss: We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability - model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.

AK

151,511 views • 1 year ago

For large-scale robotic deployment🤖 in the real-world 🌏, robots must adapt to changes in environment and objects. Ever questioned the generalizability of your robot's manipulation policy? Put it to the test with The Colosseum 🏛️. Check out our project:

For large-scale robotic deployment🤖 in the real-world 🌏, robots must adapt to changes in environment and objects. Ever questioned the generalizability of your robot's manipulation policy? Put it to the test with The Colosseum 🏛️. Check out our project:

Jiafei Duan

36,617 views • 2 years ago

Tired of teleoperating your robots? We built a way to scale robot datasets without teleop, dynamic simulation, or even robot hardware. Just one smartphone scan + one human hand demo video → thousands of diverse robot trajectories. Trainable by diffusion policy and VLA models as-is. Introducing: Real2Render2Real 👉

Tired of teleoperating your robots? We built a way to scale robot datasets without teleop, dynamic simulation, or even robot hardware. Just one smartphone scan + one human hand demo video → thousands of diverse robot trajectories. Trainable by diffusion policy and VLA models as-is. Introducing: Real2Render2Real 👉

Max Fu

69,310 views • 1 year ago

Reasoning models tailored for diverse AI tasks. 🚀 Meet Amazon Nova 2 foundation models, supporting fast, cost-effective reasoning to multimodal capabilities. Power versatile tasks like AI agents, code-generation, and Conversational AI. Choose the perfect match for your workload.

Reasoning models tailored for diverse AI tasks. 🚀 Meet Amazon Nova 2 foundation models, supporting fast, cost-effective reasoning to multimodal capabilities. Power versatile tasks like AI agents, code-generation, and Conversational AI. Choose the perfect match for your workload.

Amazon Web Services

2,074,714 views • 6 months ago

Exciting progress on Vision-Language-Action models from a collaboration between San Francisco-based Physical Intelligence (π) and China’s AGIBOT: (π)’s single model can autonomously perform diverse tasks on the AGIBOT G1 robot, using both humanoid hands and two-finger grippers.

Exciting progress on Vision-Language-Action models from a collaboration between San Francisco-based Physical Intelligence (π) and China’s AGIBOT: (π)’s single model can autonomously perform diverse tasks on the AGIBOT G1 robot, using both humanoid hands and two-finger grippers.

The Humanoid Hub

32,575 views • 1 year ago

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 views • 3 years ago