Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Traditional tokenization methods for robotic actions struggle with high-frequency, dexterous tasks due to redundancy and inefficiency. Inspired by JPEG compression, Physical Intelligence has developed a compressed action representation that accelerates VLA model training 5x.

The Humanoid Hub

108,094 subscribers

26,198 views • 1 year ago •via X (Twitter)

Education Health & Wellness Science & Technology

Anya Rossi• Live Now

Private livecam show

7 Comments

The Humanoid Hub1 year ago

More details in this thread:

UserInterface2 years ago

Unveiling the Future of Prompt Engineering for Better AI Interactions #tech

navuud1 year ago

How does one build a laundry robot like this at home 😁

Brian Bellia1 year ago

So, does this mean the end of teleoperation as a means of training humanoids? I hope it spells the end of teleoperation, period. At this stage, it should be autonomous or nothing - except in rare cases like Optimus catching a ball.

The Humanoid Hub1 year ago

The DROID dataset they using for training is generated with teleop

Disarm.AGI.UBI1 year ago

Give'm some practice. They will get better.

Rethynk AI1 year ago

Really like the way it put the 2nd on the top of the first. After intellectual capital, machines are getting better understanding of physical environment.

Related Videos

This dexterous robotic hand can see, feel, and solve a Rubik’s cube BrainCo's Revo 3 dexterous robotic hand, a Chinese-developed system with integrated vision, full-palm tactile sensors, and high-precision actuators for complex manipulation tasks

This dexterous robotic hand can see, feel, and solve a Rubik’s cube BrainCo's Revo 3 dexterous robotic hand, a Chinese-developed system with integrated vision, full-palm tactile sensors, and high-precision actuators for complex manipulation tasks

Science girl

19,923 views • 2 months ago

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Xiao Ma

93,692 views • 5 months ago

Anyone skeptical about AI’s ability to perform diverse physical tasks should watch this. Physical Intelligence's π₀ model in action: 18 minutes of bimanual robots autonomously handling complex, dexterous chores.

Anyone skeptical about AI’s ability to perform diverse physical tasks should watch this. Physical Intelligence's π₀ model in action: 18 minutes of bimanual robots autonomously handling complex, dexterous chores.

The Humanoid Hub

210,404 views • 1 year ago

📢 First contact between a frontier model and robots! Gemini Robotics is a SOTA generalist Vision-Language-Action model bringing frontier model intelligence to the physical world. It's an extremely capable model enabling dexterous, steerable, and general robot control. 🧵⬇️

📢 First contact between a frontier model and robots! Gemini Robotics is a SOTA generalist Vision-Language-Action model bringing frontier model intelligence to the physical world. It's an extremely capable model enabling dexterous, steerable, and general robot control. 🧵⬇️

Ted Xiao

152,423 views • 1 year ago

🚀 First step to unlocking Generalist Robots! Introducing 🤖LAPA🤖, a new SOTA open-sourced 7B VLA pretrained without using action labels. 💪SOTA VLA trained with Open X (outperforming OpenVLA on cross and multi embodiment) 😯LAPA enables learning from human videos, unlocking potential for robotic foundation model ❗Over 30x pretraining efficiency for VLA training 🤗Code and checkpoints are all open-sourced!

🚀 First step to unlocking Generalist Robots! Introducing 🤖LAPA🤖, a new SOTA open-sourced 7B VLA pretrained without using action labels. 💪SOTA VLA trained with Open X (outperforming OpenVLA on cross and multi embodiment) 😯LAPA enables learning from human videos, unlocking potential for robotic foundation model ❗Over 30x pretraining efficiency for VLA training 🤗Code and checkpoints are all open-sourced!

Seonghyeon Ye

33,239 views • 1 year ago

Really excited to share what I've been working on with my colleagues at Physical Intelligence! We've developed a prototype robotic foundation model that can fold laundry, assemble a box, bus a table, and many other things. We've written a paper and blog post about it. 🧵👇

Really excited to share what I've been working on with my colleagues at Physical Intelligence! We've developed a prototype robotic foundation model that can fold laundry, assemble a box, bus a table, and many other things. We've written a paper and blog post about it. 🧵👇

Sergey Levine

114,931 views • 1 year ago

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

AK

84,940 views • 2 years ago

Some great videos from the Gemini VLA release. These kinds of dexterous tasks require tons of coordination and interact with materials that are very hard to model in simulation. Good use of e2e learning and yet more evidence this is the right direction for robotics

Some great videos from the Gemini VLA release. These kinds of dexterous tasks require tons of coordination and interact with materials that are very hard to model in simulation. Good use of e2e learning and yet more evidence this is the right direction for robotics

Chris Paxton

30,949 views • 1 year ago

We developed RECAP Physical Intelligence to apply RL and interventions to π0.6, achieving high success rates and throughput on several challenging tasks! Watching these policies operate successfully for hours gives an appreciation for what the method can do

We developed RECAP Physical Intelligence to apply RL and interventions to π0.6, achieving high success rates and throughput on several challenging tasks! Watching these policies operate successfully for hours gives an appreciation for what the method can do

Michael Equi

14,241 views • 7 months ago

Thrilled to announce my company Scout AI has officially emerged from stealth At Scout, we’re building the robotic foundation model for defense, bringing Silicon Valley physical intelligence to the U.S. military Several updates to share: → We raised an oversubscribed $15M seed round led by Align Ventures and Booz Allen Ventures → We’re unveiling Fury – the first Vision-Language-Action (VLA) foundation model purpose built for defense robotics → We have been selected for two DoD contracts to deploy Fury Giddy up 🫡🇺🇸

Thrilled to announce my company Scout AI has officially emerged from stealth At Scout, we’re building the robotic foundation model for defense, bringing Silicon Valley physical intelligence to the U.S. military Several updates to share: → We raised an oversubscribed $15M seed round led by Align Ventures and Booz Allen Ventures → We’re unveiling Fury – the first Vision-Language-Action (VLA) foundation model purpose built for defense robotics → We have been selected for two DoD contracts to deploy Fury Giddy up 🫡🇺🇸

Colby Adcock

131,509 views • 1 year ago

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

The Humanoid Hub

11,575 views • 4 months ago

JARVIS-VLA just dropped on Hugging Face Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse obtain VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, demonstrate that approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance.

JARVIS-VLA just dropped on Hugging Face Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse obtain VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, demonstrate that approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance.

AK

60,162 views • 1 year ago

Sim2Real RL for Vision-Based Dexterous Manipulation on Humanoids TLDR - we train a humanoid robot with two multifingered hands to perform a range of dexterous manipulation tasks robust generalization and high performance without human demonstration :D

Sim2Real RL for Vision-Based Dexterous Manipulation on Humanoids TLDR - we train a humanoid robot with two multifingered hands to perform a range of dexterous manipulation tasks robust generalization and high performance without human demonstration :D

Toru

49,561 views • 1 year ago

NIMHANS has shown that modern systems of healthcare can successfully incorporate traditional methods such as Yoga to alleviate both mental and physical distress.

NIMHANS has shown that modern systems of healthcare can successfully incorporate traditional methods such as Yoga to alleviate both mental and physical distress.

President of India

30,737 views • 1 year ago

We just open-sourced G0 Plus VLA model & launched "Pick Up Anything" demo. See our robot perform diverse real-world tasks through pure language. No specialized training needed. That's zero-shot embodied intelligence. #VLA #Robotics #OpenSource 🔗Try now：

We just open-sourced G0 Plus VLA model & launched "Pick Up Anything" demo. See our robot perform diverse real-world tasks through pure language. No specialized training needed. That's zero-shot embodied intelligence. #VLA #Robotics #OpenSource 🔗Try now：

Galaxea Dynamics

98,822 views • 5 months ago

A major physical painting that was initiated with A.I was acquired today by the forward thinking and innovative Blondie. I recently posted a WIP and Blondie, without hesitation, commented ‘dibs’. She didn’t know yet that the WIP was 100% A.I. and she didn’t know it also inspired a physical. The A.I. work was made by training a MidJourney Model on some of my works. The image that emerged inspired me to create a version with the traditional medium of oil paint on canvas. I am super thrilled to announce that the visionary collector Blondie acquired “Drive-by Diagnosis”, 48” X 60”, for 9.45 eth. 🙏🙏🙏 LFG

A major physical painting that was initiated with A.I was acquired today by the forward thinking and innovative Blondie. I recently posted a WIP and Blondie, without hesitation, commented ‘dibs’. She didn’t know yet that the WIP was 100% A.I. and she didn’t know it also inspired a physical. The A.I. work was made by training a MidJourney Model on some of my works. The image that emerged inspired me to create a version with the traditional medium of oil paint on canvas. I am super thrilled to announce that the visionary collector Blondie acquired “Drive-by Diagnosis”, 48” X 60”, for 9.45 eth. 🙏🙏🙏 LFG

hafftka

24,740 views • 2 years ago

Microsoft just dropped VITRA-VLA, a new Vision-Language-Action model for robotics on Hugging Face. It learns dexterous manipulation from over 1 million real-life human hand activity videos.

Microsoft just dropped VITRA-VLA, a new Vision-Language-Action model for robotics on Hugging Face. It learns dexterous manipulation from over 1 million real-life human hand activity videos.

DailyPapers

19,092 views • 6 months ago

Today, we're joined by Sergey Levine, associate professor at UC Berkeley EECS and co-founder of Physical Intelligence to discuss π0 (pi-zero), a general-purpose robotic foundation model. We dig into the model architecture, which pairs a vision language model (VLM) with a diffusion-based action expert, and the model training "recipe," emphasizing the roles of pre-training and post-training with a diverse mixture of real-world data to ensure robust and intelligent robot learning. We review the data collection approach, which uses human operators and teleoperation rigs, the potential of synthetic data and reinforcement learning in enhancing robotic capabilities, and much more. We also introduce the team’s new FAST tokenizer, which opens the door to a fully Transformer-based model and significant improvements in learning and generalization. Finally, we cover the open-sourcing of π0 and future directions for their research. 🎧 / 🎥 Listen or watch the full episode on our page: 📖 CHAPTERS =============================== 00:00 - Introduction 2:14 - Physical Intelligence 3:47 - Key challenges in robotic learning 6:13 - Reinforcement learning in π0 and robotic foundation models 8:36 - π0 VLM model architecture 15:33 - π0 model recipe 18:39 - Pre-training dataset 22:47 - Post-training 24:23 - Laundry folding demo 31:32 - Scaling laws on π0 model 34:57 - FAST 40:26 - Open sourcing π0 43:37 - Other robot types 46:27 - Future directions

Today, we're joined by Sergey Levine, associate professor at UC Berkeley EECS and co-founder of Physical Intelligence to discuss π0 (pi-zero), a general-purpose robotic foundation model. We dig into the model architecture, which pairs a vision language model (VLM) with a diffusion-based action expert, and the model training "recipe," emphasizing the roles of pre-training and post-training with a diverse mixture of real-world data to ensure robust and intelligent robot learning. We review the data collection approach, which uses human operators and teleoperation rigs, the potential of synthetic data and reinforcement learning in enhancing robotic capabilities, and much more. We also introduce the team’s new FAST tokenizer, which opens the door to a fully Transformer-based model and significant improvements in learning and generalization. Finally, we cover the open-sourcing of π0 and future directions for their research. 🎧 / 🎥 Listen or watch the full episode on our page: 📖 CHAPTERS =============================== 00:00 - Introduction 2:14 - Physical Intelligence 3:47 - Key challenges in robotic learning 6:13 - Reinforcement learning in π0 and robotic foundation models 8:36 - π0 VLM model architecture 15:33 - π0 model recipe 18:39 - Pre-training dataset 22:47 - Post-training 24:23 - Laundry folding demo 31:32 - Scaling laws on π0 model 34:57 - FAST 40:26 - Open sourcing π0 43:37 - Other robot types 46:27 - Future directions

The TWIML AI Podcast

19,942 views • 1 year ago