Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

🚀 First step to unlocking Generalist Robots! Introducing 🤖LAPA🤖, a new SOTA open-sourced 7B VLA pretrained without using action labels. 💪SOTA VLA trained with Open X (outperforming OpenVLA on cross and multi embodiment) 😯LAPA enables learning from human videos, unlocking potential for robotic foundation model ❗Over 30x pretraining efficiency... show more

Seonghyeon Ye

1,694 subscribers

33,239 Aufrufe • vor 1 Jahr •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

9 Kommentare

Profilbild von Seonghyeon Ye

Seonghyeon Yevor 1 Jahr

LAPA consists of a 1) Latent Action Quantization and 2) Latent Pretraining stage. The first stage learns quantized actions through visual deltas. For the second stage, a pretrained VLM (LWM) is trained to predict the quantized latent actions. During finetuning, we map the latent actions to real actions.

Profilbild von Seonghyeon Ye

Seonghyeon Yevor 1 Jahr

LAPA beats OpenVLA across cross- and multi-embodiment tasks, all without using action labels during pretraining! 🚀 A step forward in robust, generalizable robot learning.

Profilbild von Seonghyeon Ye

Seonghyeon Yevor 1 Jahr

We can built LAPA from 220K human videos where the action labels does not exist and the embodiment gap is huge. Still, LAPA (Human Videos) outperform OpenVLA (Bridge) 😯

Profilbild von Seonghyeon Ye

Seonghyeon Yevor 1 Jahr

What do latent actions mean? Latent actions correspond to 🌟semantic 🌟actions across different robot embodiments. Interestingly, despite different robot embodiments, the same latent action maps to similar movements. This suggests latent actions form a ‘shared’ representation space, much like images or language.

Profilbild von Seonghyeon Ye

Seonghyeon Yevor 1 Jahr

We also do closed-loop rollout of LAPA for analysis (rollout vs ground truth). Given ‘pick up the broccoli from pot,’ it successfully picks up the object, which then disappears. This highlights LAPA’s potential as an emerging 'world model' with impressive predictive abilities! 🌍

Profilbild von Seonghyeon Ye

Seonghyeon Yevor 1 Jahr

All of the things are open-sourced! Code 💻: Huggingface 🤗: Website🌐:

Profilbild von Seonghyeon Ye

Seonghyeon Yevor 1 Jahr

Co-led with @jang_yoel With wonderful collaborators: Byeongguk Jeon, @joocjun ,@jw2yang4ai Baolin Peng, @AjayMandlekar, Reuben Tan, Yu-Wei Chao, @billyuchenlin, Lars Liden And advisors: @kimin_le2, @JianfengGao0217, @LukeZettlemoyer, Dieter Fox, @seo_minjoon from @kaist_ai, @UW, @Microsoft, @nvidia, @allen_ai

Profilbild von Lingxuan Wu

Lingxuan Wuvor 1 Jahr

Firm step towards scaling up VLA model！Excellent job！

Profilbild von Jiazhi Yang

Jiazhi Yangvor 1 Jahr

Incredible job! I view it as the first practical evidence that unsupervised latent action could work well at such a large scale.

Ähnliche Videos

✨ Introducing 𝐎𝐩𝐞𝐧𝐕𝐋𝐀 — an open-source vision-language-action model for robotics! 👐 - SOTA generalist policy - 7B params - outperforms Octo, RT-2-X on zero-shot evals 🦾 - trained on 970k episodes from OpenX dataset 🤖 - fully open: model/code/data all online 🤗 🧵👇

✨ Introducing 𝐎𝐩𝐞𝐧𝐕𝐋𝐀 — an open-source vision-language-action model for robotics! 👐 - SOTA generalist policy - 7B params - outperforms Octo, RT-2-X on zero-shot evals 🦾 - trained on 970k episodes from OpenX dataset 🤖 - fully open: model/code/data all online 🤗 🧵👇

Moo Jin Kim

226,991 Aufrufe • vor 2 Jahren

Excited to introduce 𝐋𝐀𝐏𝐀: the first unsupervised pretraining method for Vision-Language-Action models. Outperforms SOTA models trained with ground-truth actions 30x more efficient than conventional VLA pretraining 📝: 🧵 1/9

Excited to introduce 𝐋𝐀𝐏𝐀: the first unsupervised pretraining method for Vision-Language-Action models. Outperforms SOTA models trained with ground-truth actions 30x more efficient than conventional VLA pretraining 📝: 🧵 1/9

Joel Jang

46,018 Aufrufe • vor 1 Jahr

🆕 Introducing JAT, the first open-source multi-modal, multi-task multi-domain agent! 🤖 A step toward open generalist agents! 🚀 📰 Blog:

🆕 Introducing JAT, the first open-source multi-modal, multi-task multi-domain agent! 🤖 A step toward open generalist agents! 🚀 📰 Blog:

Quentin Gallouédec

73,212 Aufrufe • vor 2 Jahren

VLA-JEPA just dropped in LeRobot 🤖 What makes this model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics. During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos. At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head. The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on NVIDIA Robotics DGX Spark! VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀 Thomas Wolf clem 🤗

VLA-JEPA just dropped in LeRobot 🤖 What makes this model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics. During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos. At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head. The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on NVIDIA Robotics DGX Spark! VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀 Thomas Wolf clem 🤗

LeRobot

302,428 Aufrufe • vor 26 Tagen

1/5 🚀 Thrilled to open-source OSCAR 🤖 — an action-conditioned world model for robotics, led by the visiting student in my group Zhuoyuan Wu! It generalizes across different robot embodiments with precise action controllability. All trained on a single GH200 GPU, and outperforms existing open-sourced baselines, which have larger model capacity and need more compute. Everything is public, including training data. 📄 Paper: 🌐 Project: 💻 Code: 🤗 Robot data: 🤗 Human data: 🤗 Weights: #Robotics #WorldModels #AI #OpenSource

1/5 🚀 Thrilled to open-source OSCAR 🤖 — an action-conditioned world model for robotics, led by the visiting student in my group Zhuoyuan Wu! It generalizes across different robot embodiments with precise action controllability. All trained on a single GH200 GPU, and outperforms existing open-sourced baselines, which have larger model capacity and need more compute. Everything is public, including training data. 📄 Paper: 🌐 Project: 💻 Code: 🤗 Robot data: 🤗 Human data: 🤗 Weights: #Robotics #WorldModels #AI #OpenSource

Jun Gao

103,427 Aufrufe • vor 21 Tagen

We just open-sourced G0 Plus VLA model & launched "Pick Up Anything" demo. See our robot perform diverse real-world tasks through pure language. No specialized training needed. That's zero-shot embodied intelligence. #VLA #Robotics #OpenSource 🔗Try now：

We just open-sourced G0 Plus VLA model & launched "Pick Up Anything" demo. See our robot perform diverse real-world tasks through pure language. No specialized training needed. That's zero-shot embodied intelligence. #VLA #Robotics #OpenSource 🔗Try now：

Galaxea Dynamics

98,822 Aufrufe • vor 5 Monaten

🤖🤖🤖 Following RoboVerse, we introduce another work focused on Robotic Tactile Simulation - Taccel Simulator. Taccel is a high-performance simulation platform for vision-based tactile sensors and robots. 🚀🚀🚀 Boosted by Nvidia Warp, we optimize Taccel with highly parallelized simulations and support 900fps simulation with 4k+ parallel training envs. 🤝🤝🤝 Taccel is designed with user-friendly APIs and is easy to use. We open-sourced all the code and documentation. Feel free to try! Project: Preprint: Code:

🤖🤖🤖 Following RoboVerse, we introduce another work focused on Robotic Tactile Simulation - Taccel Simulator. Taccel is a high-performance simulation platform for vision-based tactile sensors and robots. 🚀🚀🚀 Boosted by Nvidia Warp, we optimize Taccel with highly parallelized simulations and support 900fps simulation with 4k+ parallel training envs. 🤝🤝🤝 Taccel is designed with user-friendly APIs and is easy to use. We open-sourced all the code and documentation. Feel free to try! Project: Preprint: Code:

Siyuan Huang

10,658 Aufrufe • vor 1 Jahr

🚀 1/7 We are thrilled to launch LLM360 — pushing the frontier of open-source & transparent LLMs! Starting with Amber (7B) & CrystalCoder (7B), we are releasing brand new pre-trained LLMs with all training code, data, and up to 360 model checkpoints. 🔗

LLM360

329,456 Aufrufe • vor 2 Jahren

🤖Humanoid robots are not exclusively for big companies. Here is what a few people can do. 👉Introducing Mobile-TeleVision, built upon our previously open-sourced immersive teleop system: Open-TeleVision.

🤖Humanoid robots are not exclusively for big companies. Here is what a few people can do. 👉Introducing Mobile-TeleVision, built upon our previously open-sourced immersive teleop system: Open-TeleVision.

Xuxin Cheng

90,850 Aufrufe • vor 1 Jahr

🤯ByteDance just Open Sourced UI-TARS - 2 SOTA models (7B & 72B) + a PC/MacOS app to control your computer with vLMS And they are not messing around, beating GPT-4o and Claude, SOTA across 10 benchmarks Will you be installing this on your pc?

🤯ByteDance just Open Sourced UI-TARS - 2 SOTA models (7B & 72B) + a PC/MacOS app to control your computer with vLMS And they are not messing around, beating GPT-4o and Claude, SOTA across 10 benchmarks Will you be installing this on your pc?

Alex Volkov

69,738 Aufrufe • vor 1 Jahr

VLAs offer an avenue for generalist robot policies; however, naively following the action predictions leads to brittle or unsafe behaviours. We introduce VLAPS, which integrates model-based search with pre-trained VLA policies to improve performance without additional training.

VLAs offer an avenue for generalist robot policies; however, naively following the action predictions leads to brittle or unsafe behaviours. We introduce VLAPS, which integrates model-based search with pre-trained VLA policies to improve performance without additional training.

Glen Berseth

13,374 Aufrufe • vor 10 Monaten

Microsoft just dropped VITRA-VLA, a new Vision-Language-Action model for robotics on Hugging Face. It learns dexterous manipulation from over 1 million real-life human hand activity videos.

Microsoft just dropped VITRA-VLA, a new Vision-Language-Action model for robotics on Hugging Face. It learns dexterous manipulation from over 1 million real-life human hand activity videos.

DailyPapers

19,092 Aufrufe • vor 6 Monaten

We are open-sourcing Wall-OSS-0.5. Pretrain Once, Act Anywhere. Wall-OSS-0.5 is a VLA model for real-world robotic manipulation, exploring whether pretraining alone can produce robot capabilities directly testable on physical hardware before task-specific fine-tuning. Key technical highlights: • Gradient-bridged co-training • Vision-Aligned RVQ Action Tokenizer • Action-Space Supervision • DMuon distributed optimizer In zero-shot real-robot evaluation, the pretrained checkpoint achieved task-progress scores above 80 on multiple tasks, including Block Sorting, Fruit Sorting, Ring Stacking, and Rope Tightening. Paper, code, blog, and uncut videos:

We are open-sourcing Wall-OSS-0.5. Pretrain Once, Act Anywhere. Wall-OSS-0.5 is a VLA model for real-world robotic manipulation, exploring whether pretraining alone can produce robot capabilities directly testable on physical hardware before task-specific fine-tuning. Key technical highlights: • Gradient-bridged co-training • Vision-Aligned RVQ Action Tokenizer • Action-Space Supervision • DMuon distributed optimizer In zero-shot real-robot evaluation, the pretrained checkpoint achieved task-progress scores above 80 on multiple tasks, including Block Sorting, Fruit Sorting, Ring Stacking, and Rope Tightening. Paper, code, blog, and uncut videos:

X Square Robot

24,313 Aufrufe • vor 1 Monat

Introducing TraceVLA: a fully open-source Vision-Language-Action model reimagining spatial-temporal awareness: ✨ 3.5x gains on real robots, SOTA in simulation 💡 Fine-tunes on just 150K trajectories ⚡ Compact 4B model = 7B performance

Introducing TraceVLA: a fully open-source Vision-Language-Action model reimagining spatial-temporal awareness: ✨ 3.5x gains on real robots, SOTA in simulation 💡 Fine-tunes on just 150K trajectories ⚡ Compact 4B model = 7B performance

Yongyuan Liang

39,500 Aufrufe • vor 1 Jahr

🤖 New paper: MobileVLA-R1 A unified VLA system that brings real reasoning + continuous control to quadruped robots. CoT dataset, 2-stage training, real-world deployment. 📄paper & code & demo:

🤖 New paper: MobileVLA-R1 A unified VLA system that brings real reasoning + continuous control to quadruped robots. CoT dataset, 2-stage training, real-world deployment. 📄paper & code & demo:

Hao Tang (hiring postdocs)

27,232 Aufrufe • vor 7 Monaten

UK-based startup 'Humanoid' announced KinetIQ, an AI framework with a Vision-Language-Action (VLA) model at its core. It uses a four-layer architecture: fleet orchestration, task decomposition, VLA, and RL for whole-body control. It works on both bipedal and wheeled robots.

UK-based startup 'Humanoid' announced KinetIQ, an AI framework with a Vision-Language-Action (VLA) model at its core. It uses a four-layer architecture: fleet orchestration, task decomposition, VLA, and RL for whole-body control. It works on both bipedal and wheeled robots.

The Humanoid Hub

21,954 Aufrufe • vor 4 Monaten

🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step closer to World Models. All code for training & inference, model weights, demo page, and videos released! Paper: Code: Model: Demo Page: Introduction Video:

🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step closer to World Models. All code for training & inference, model weights, demo page, and videos released! Paper: Code: Model: Demo Page: Introduction Video:

Yukang Chen

11,835 Aufrufe • vor 9 Monaten

NVIDIA has open-sourced SONIC, a humanoid behavior foundation model that gives robots a core set of motor skills learned from large-scale human motion data.

NVIDIA has open-sourced SONIC, a humanoid behavior foundation model that gives robots a core set of motor skills learned from large-scale human motion data.

The Humanoid Hub

33,542 Aufrufe • vor 4 Monaten

We’ve teamed up with X Square Robot to integrate WALL-OSS, a powerful new VLA foundation model into LeRobot!

We’ve teamed up with X Square Robot to integrate WALL-OSS, a powerful new VLA foundation model into LeRobot!

LeRobot

21,673 Aufrufe • vor 6 Monaten

🎤🎤 Excited to introduce COME-robot🤖🤖, Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V. It is the first closed-loop framework utilizing the vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot demonstrates a significant improvement in task success rate (~25%) compared to SOTA methods. Project: Arxiv:

🎤🎤 Excited to introduce COME-robot🤖🤖, Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V. It is the first closed-loop framework utilizing the vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot demonstrates a significant improvement in task success rate (~25%) compared to SOTA methods. Project: Arxiv:

Siyuan Huang

22,291 Aufrufe • vor 2 Jahren