Загрузка видео...

Не удалось загрузить видео

На главную

Robots might learn better from video than from language! 📼 Most Vision-Language-Action (VLA) models learn what to do from text, but still struggle with how things move in the real world. That makes them data-hungry and slow to train. mimic video takes a different route. Instead of grounding robot...

49,920 просмотров • 5 месяцев назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

🚨 BREAKING: Microsoft's first robotics foundation model! 🤯 Microsoft just announced Rho-alpha (ρα), their first robotics model derived from the Phi series of vision-language models. Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks. Commands like "push the green button with the right gripper," "pull out the red wire," "flip the top switch on," or "turn the knob to position 5" get executed directly by dual-arm robots. What makes this different from standard vision-language-action (VLA) models is the additional modalities. Rho-alpha is a VLA+ model that adds tactile sensing to the perceptual mix, with plans to incorporate force feedback. On the learning side, the model is designed to continually improve during deployment by learning from human feedback. The training approach combines trajectories from physical demonstrations and simulated tasks with web-scale visual question answering data. Since teleoperation data is scarce and expensive, Microsoft is using NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets via reinforcement learning. These simulated trajectories get combined with commercial and open physical demonstration datasets. The model is currently under evaluation on dual-arm setups and humanoid robots. Microsoft is opening an Early Access Program for organizations interested in evaluating Rho-alpha. Robots that can adapt to dynamic situations and human preferences are more useful in real environments and more trusted by the people operating them. Read more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

60,805 просмотров • 4 месяцев назад

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Jim Fan

465,670 просмотров • 1 год назад

I don’t know if we live in a Matrix, but I know for sure that robots will spend most of their lives in simulation. Let machines train machines. I’m excited to introduce DexMimicGen, a massive-scale synthetic data generator that enables a humanoid robot to learn complex skills from only a handful of human demonstrations. Yes, as few as 5! DexMimicGen addresses the biggest pain point in robotics: where do we get data? Unlike with LLMs, where vast amounts of texts are readily available, you cannot simply download motor control signals from the internet. So researchers teleoperate the robots to collect motion data via XR headsets. They have to repeat the same skill over and over and over again, because neural nets are data hungry. This is a very slow and uncomfortable process. At NVIDIA, we believe the majority of high-quality tokens for robot foundation models will come from simulation. What DexMimicGen does is to trade GPU compute time for human time. It takes one motion trajectory from human, and multiplies into 1000s of new trajectories. A robot brain trained on this augmented dataset will generalize far better in the real world. Think of DexMimicGen as a learning signal amplifier. It maps a small dataset to a large (de facto infinite) dataset, using physics simulation in the loop. In this way, we free humans from babysitting the bots all day. The future of robot data is generative. The future of the entire robot learning pipeline will also be generative. 🧵

Jim Fan

165,215 просмотров • 1 год назад