Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Microsoft just dropped VITRA-VLA, a new Vision-Language-Action model for robotics on Hugging Face. It learns dexterous manipulation from over 1 million real-life human hand activity videos.

DailyPapers

20,162 subscribers

19,181 просмотров • 7 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Microsoft just dropped MineWorld on Hugging Face a Real-Time and Open-Source Interactive World Model on Minecraft

Microsoft just dropped MineWorld on Hugging Face a Real-Time and Open-Source Interactive World Model on Minecraft

AK

95,035 просмотров • 1 год назад

Xiaomi-Robotics-1 just dropped on Hugging Face 🔥 A robot foundation model trained on 100,000 hours of real-world manipulation. They turned it loose in a real apartment: folding laundry, loading the washer, doing the dishes, packing a suitcase. Fully autonomous.

Xiaomi-Robotics-1 just dropped on Hugging Face 🔥 A robot foundation model trained on 100,000 hours of real-world manipulation. They turned it loose in a real apartment: folding laundry, loading the washer, doing the dishes, packing a suitcase. Fully autonomous.

Victor M

129,867 просмотров • 11 дней назад

How to learn dexterous manipulation for any robot hand from a single human demonstration? Check out DexMachina, our new RL algorithm that learns long-horizon, bimanual dexterous policies for a variety of dexterous hands, articulated objects, and complex motions.

How to learn dexterous manipulation for any robot hand from a single human demonstration? Check out DexMachina, our new RL algorithm that learns long-horizon, bimanual dexterous policies for a variety of dexterous hands, articulated objects, and complex motions.

Mandi Zhao

120,954 просмотров • 1 год назад

The next evolution: VLA+ models Just yesterday Microsoft Research released Rho-alpha (ρα) – their first robotics model, built on the Phi family. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds: ▪️ Tactile sensing to feel objects during manipulation ▪️ Online learning that lets it improve from human corrections (via teleoperation, 3D mouse or other tools) in real-time even after deployment. Both these sides make adaptability central rather than incidental. Microsoft calls it a VLA+ model, positioning it as an extension beyond what current VLA systems support. ➡️ Today Rho-alpha can control dual-arm robot setups to perform tasks such as: • Manipulating the BusyBox following natural-language instructions • Plug insertion • Toolbox packing and object arrangement with bimanual coordination But to understand why this "plus" matters, we need to understand what came before. Here, we'll take you through the entire landscape of VLA models – Gemini Robotics, π0, SmolVLA, Helix, ACoT-VLA and others:

The next evolution: VLA+ models Just yesterday Microsoft Research released Rho-alpha (ρα) – their first robotics model, built on the Phi family. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds: ▪️ Tactile sensing to feel objects during manipulation ▪️ Online learning that lets it improve from human corrections (via teleoperation, 3D mouse or other tools) in real-time even after deployment. Both these sides make adaptability central rather than incidental. Microsoft calls it a VLA+ model, positioning it as an extension beyond what current VLA systems support. ➡️ Today Rho-alpha can control dual-arm robot setups to perform tasks such as: • Manipulating the BusyBox following natural-language instructions • Plug insertion • Toolbox packing and object arrangement with bimanual coordination But to understand why this "plus" matters, we need to understand what came before. Here, we'll take you through the entire landscape of VLA models – Gemini Robotics, π0, SmolVLA, Helix, ACoT-VLA and others:

Turing Post

62,362 просмотров • 6 месяцев назад

Vision-Language Foundation model should go to 3D for robotics!🤖 CoRL23 Oral: GNFactor learns Generalizable Neural Feature Fields for language conditioned manipulation on diverse scenes. It unifies 3D➕Stable Diffusion features using generalizable NeRFs.

Vision-Language Foundation model should go to 3D for robotics!🤖 CoRL23 Oral: GNFactor learns Generalizable Neural Feature Fields for language conditioned manipulation on diverse scenes. It unifies 3D➕Stable Diffusion features using generalizable NeRFs.

Xiaolong Wang

56,268 просмотров • 2 лет назад

Another day, another humanoid robot from china AGIBOT introduces GO-1, a generalist foundation model that integrates a vision-language model with a latent planner for enhanced long-horizon and dexterous manipulation.

Another day, another humanoid robot from china AGIBOT introduces GO-1, a generalist foundation model that integrates a vision-language model with a latent planner for enhanced long-horizon and dexterous manipulation.

Chubby♨️

23,513 просмотров • 1 год назад

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

Xiao Ma

46,323 просмотров • 1 год назад

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Ilia

32,956 просмотров • 10 месяцев назад

VLA-JEPA just dropped in LeRobot 🤖 What makes this model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics. During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos. At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head. The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on NVIDIA Robotics DGX Spark! VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀 Thomas Wolf clem 🤗

VLA-JEPA just dropped in LeRobot 🤖 What makes this model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics. During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos. At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head. The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on NVIDIA Robotics DGX Spark! VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀 Thomas Wolf clem 🤗

LeRobot

319,295 просмотров • 1 месяц назад

Today, we announced 𝗥𝗧-𝟮: a first of its kind vision-language-action model to control robots. 🤖 It learns from both web and robotics data and translates this knowledge into generalised instructions. Find out more:

Today, we announced 𝗥𝗧-𝟮: a first of its kind vision-language-action model to control robots. 🤖 It learns from both web and robotics data and translates this knowledge into generalised instructions. Find out more:

Google DeepMind

537,833 просмотров • 3 лет назад

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Xiao Ma

93,908 просмотров • 7 месяцев назад

Why LLMs are a dead end for human-level intelligence, and especially for Physical AI / Robotics. The next leap isn’t bigger language models. It’s World Models. I just dropped a full 1-hour presentation from Shanghai: “World Models: the ChatGPT moment for robotics?” → Why LLMs hit a wall → Why action-conditioned world models planning in latent space are the real path → Live World Forge demo with LeWorldModel + Hugging Face LeRobot Watch here. The future of intelligence is embodied, not just chatty.

Why LLMs are a dead end for human-level intelligence, and especially for Physical AI / Robotics. The next leap isn’t bigger language models. It’s World Models. I just dropped a full 1-hour presentation from Shanghai: “World Models: the ChatGPT moment for robotics?” → Why LLMs hit a wall → Why action-conditioned world models planning in latent space are the real path → Live World Forge demo with LeWorldModel + Hugging Face LeRobot Watch here. The future of intelligence is embodied, not just chatty.

abdel

37,570 просмотров • 1 месяц назад

Robotics just hit a dexterity milestone 🤯! Sharpa Robotics demonstrated autonomous dual-hand apple peeling using its new MoDE-VLA system. Using human-like dexterous hands with tactile sensing, the robot can feel contact and adjust its grip while rotating and peeling the apple. While the system achieved 30% success, the 73% peel completion rate shows robots are starting to master the fine, contact-rich manipulation needed for real world tasks.

Robotics just hit a dexterity milestone 🤯! Sharpa Robotics demonstrated autonomous dual-hand apple peeling using its new MoDE-VLA system. Using human-like dexterous hands with tactile sensing, the robot can feel contact and adjust its grip while rotating and peeling the apple. While the system achieved 30% success, the 73% peel completion rate shows robots are starting to master the fine, contact-rich manipulation needed for real world tasks.

SciTech Era

39,306 просмотров • 4 месяцев назад

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

Jiafei Duan

48,777 просмотров • 1 год назад

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

Hongyu Li

105,628 просмотров • 9 месяцев назад

UK-based startup 'Humanoid' announced KinetIQ, an AI framework with a Vision-Language-Action (VLA) model at its core. It uses a four-layer architecture: fleet orchestration, task decomposition, VLA, and RL for whole-body control. It works on both bipedal and wheeled robots.

UK-based startup 'Humanoid' announced KinetIQ, an AI framework with a Vision-Language-Action (VLA) model at its core. It uses a four-layer architecture: fleet orchestration, task decomposition, VLA, and RL for whole-body control. It works on both bipedal and wheeled robots.

The Humanoid Hub

21,954 просмотров • 5 месяцев назад

🔥 #ICRA2026 Best Paper Finalist The era of "robot VLA = single-arm gripper" is ending. Introducing Dexora — the first open-source Vision-Language-Action system for dual-arm, dual-hand, 36-DoF dexterous manipulation. 🦾 Dual Arms 🖐️ Dual Hands 🎯 36 DoF Control 🌍 Open Source Trained on: • 100K simulated trajectories • 10K real-world demonstrations Dexora achieves: ✓ 90%+ success on basic manipulation ✓ Strong dexterous manipulation performance ✓ Cross-embodiment generalization Our key hypothesis: Train on the hardest embodiment. Transfer to simpler robots later. Instead of scaling up gripper policies, we train directly in the most expressive action space and project downward to simpler embodiments. This may be a practical path toward universal robot controllers. 🎥 Demos: 📄 Paper:

🔥 #ICRA2026 Best Paper Finalist The era of "robot VLA = single-arm gripper" is ending. Introducing Dexora — the first open-source Vision-Language-Action system for dual-arm, dual-hand, 36-DoF dexterous manipulation. 🦾 Dual Arms 🖐️ Dual Hands 🎯 36 DoF Control 🌍 Open Source Trained on: • 100K simulated trajectories • 10K real-world demonstrations Dexora achieves: ✓ 90%+ success on basic manipulation ✓ Strong dexterous manipulation performance ✓ Cross-embodiment generalization Our key hypothesis: Train on the hardest embodiment. Transfer to simpler robots later. Instead of scaling up gripper policies, we train directly in the most expressive action space and project downward to simpler embodiments. This may be a practical path toward universal robot controllers. 🎥 Demos: 📄 Paper:

Hao Zhao

17,048 просмотров • 1 месяц назад

JARVIS-VLA just dropped on Hugging Face Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse obtain VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, demonstrate that approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance.

JARVIS-VLA just dropped on Hugging Face Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse obtain VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, demonstrate that approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance.

AK

60,243 просмотров • 1 год назад

The first open-source unified world model for scalable robot manipulation: 5B-parameter open-source unified video-action world model that combines policy and world modeling to generate robot actions, predict future visuals, and evaluate task progress from observations, language, and state. The model is trained on 27.3K hours of heterogeneous data (17.8K real-robot teleop, 6.5K UMI demos, 3K egocentric human videos), enabling it to perform complex manipulation tasks like faucet connecting, bag packing, and toolbox storing as shown in demo videos. The approach supports test-time action refinement and points toward deployment-driven continuous improvement via fleet data. Thanks for sharing, Jianlan Luo (Jianlan Luo)! 📌 Resource links for τ0-WM: • Project page: • GitHub (code): • Hugging Face (model weights): • Paper (PDF): ——- Weekly robotics and AI insights. Subscribe free:

The first open-source unified world model for scalable robot manipulation: 5B-parameter open-source unified video-action world model that combines policy and world modeling to generate robot actions, predict future visuals, and evaluate task progress from observations, language, and state. The model is trained on 27.3K hours of heterogeneous data (17.8K real-robot teleop, 6.5K UMI demos, 3K egocentric human videos), enabling it to perform complex manipulation tasks like faucet connecting, bag packing, and toolbox storing as shown in demo videos. The approach supports test-time action refinement and points toward deployment-driven continuous improvement via fleet data. Thanks for sharing, Jianlan Luo (Jianlan Luo)! 📌 Resource links for τ0-WM: • Project page: • GitHub (code): • Hugging Face (model weights): • Paper (PDF): ——- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

28,239 просмотров • 1 месяц назад

🚀 First step to unlocking Generalist Robots! Introducing 🤖LAPA🤖, a new SOTA open-sourced 7B VLA pretrained without using action labels. 💪SOTA VLA trained with Open X (outperforming OpenVLA on cross and multi embodiment) 😯LAPA enables learning from human videos, unlocking potential for robotic foundation model ❗Over 30x pretraining efficiency for VLA training 🤗Code and checkpoints are all open-sourced!

🚀 First step to unlocking Generalist Robots! Introducing 🤖LAPA🤖, a new SOTA open-sourced 7B VLA pretrained without using action labels. 💪SOTA VLA trained with Open X (outperforming OpenVLA on cross and multi embodiment) 😯LAPA enables learning from human videos, unlocking potential for robotic foundation model ❗Over 30x pretraining efficiency for VLA training 🤗Code and checkpoints are all open-sourced!

Seonghyeon Ye

33,289 просмотров • 1 год назад