Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Announcing DreamDojo: our open-source, interactive world model that takes robot motor controls and generates the future in pixels. No engine, no meshes, no hand-authored dynamics. It's Simulation 2.0. Time for robotics to take the bitter lesson pill. Real-world robot learning is bottlenecked by time, wear, safety, and resets. If... we want Physical AI to move at pretraining speed, we need a simulator that adapts to pretraining scale with as little human engineering as possible. Our key insights: (1) human egocentric videos are a scalable source of first-person physics; (2) latent actions make them "robot-readable" across different hardware; (3) real-time inference unlocks live teleop, policy eval, and test-time planning inside a dream. We pre-train on 44K hours of human videos: cheap, abundant, and collected with zero robot-in-the-loop. Humans have already explored the combinatorics: we grasp, pour, fold, assemble, fail, retry—across cluttered scenes, shifting viewpoints, changing light, and hour-long task chains—at a scale no robot fleet could match. The missing piece: these videos have no action labels. So we introduce latent actions: a unified representation inferred directly from videos that captures "what changed between world states" without knowing the underlying hardware. This lets us train on any first-person video as if it came with motor commands attached. As a result, DreamDojo generalizes zero-shot to objects and environments never seen in any robot training set, because humans saw them first. Next, we post-train onto each robot to fit its specific hardware. Think of it as separating "how the world looks and behaves" from "how this particular robot actuates." The base model follows the general physical rules, then "snaps onto" the robot's unique mechanics. It's kind of like loading a new character and scene assets into Unreal Engine, but done through gradient descent and generalizes far beyond the post-training dataset. A world simulator is only useful if it runs fast enough to close the loop. We train a real-time version of DreamDojo that runs at 10 FPS, stable for over a minute of continuous rollout. This unlocks exciting possibilities: - Live teleoperation inside a dream. Connect a VR controller, stream actions into DreamDojo, and teleop a virtual robot in real time. We demo this on Unitree G1 with a PICO headset and one RTX 5090. - Policy evaluation. You can benchmark a policy checkpoint in DreamDojo instead of the real world. The simulated success rates strongly correlate with real-world results - accurate enough to rank checkpoints without burning a single motor. - Model-based planning. Sample multiple action proposals → simulate them all in parallel → pick the best future. Gains +17% real-world success out of the box on a fruit packing task. We open-source everything!! Weights, code, post-training dataset, eval set, and whitepaper with tons of details to reproduce. DreamDojo is based on NVIDIA Cosmos, which is open-weight too. 2026 is the year of World Models for physical AI. We want you to build with us. Happy scaling! Links in thread:show more

Jim Fan

505,011 subscribers

225,590 просмотров • 5 месяцев назад •via X (Twitter)

Образование Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

The Humanoid Hub

11,575 просмотров • 5 месяцев назад

What happens when robot world models learn from human experience at scale? 🤔 DreamDojo from NVIDIA Research is a generalist robot world model pretrained on 44K hours of egocentric human videos and then post-trained on robot data to generalize across new objects and environments. After distillation, it runs at 10 FPS for live teleoperation, policy evaluation, and model-based planning. Read the ICML paper to learn more 📄

What happens when robot world models learn from human experience at scale? 🤔 DreamDojo from NVIDIA Research is a generalist robot world model pretrained on 44K hours of egocentric human videos and then post-trained on robot data to generalize across new objects and environments. After distillation, it runs at 10 FPS for live teleoperation, policy evaluation, and model-based planning. Read the ICML paper to learn more 📄

NVIDIA Robotics

21,506 просмотров • 16 дней назад

Tired of teleoperation? One human video → 1,000s of robot demos. (📍GitHub ) Scaling Robot Data Without Dynamics Simulation or Robot Hardware Real2Render2Real (R2R2R) is a new way to scale robot data without physics simulation or hardware. You take a phone scan + a single monocular human demo. It tracks the motion, renders photorealistic scenes, and generates diverse, robot-agnostic trajectories ready for training. > No teleop, no sim, no robot, just a phone and a video > Train VLA models and diffusion policies directly on the output > Supports multiple robot embodiments with kinematic consistency > 1000s of demos in 1/27 the time of real-world collection Thank you, Max Fu, for sharing!! Project: Paper: Code coming soon: It shows that with the right pipeline, you can scale robot learning data without touching a robot. One of the most interesting directions in scalable robotics today. —— Weekly robotics and AI insights. Subscribe free:

Tired of teleoperation? One human video → 1,000s of robot demos. (📍GitHub ) Scaling Robot Data Without Dynamics Simulation or Robot Hardware Real2Render2Real (R2R2R) is a new way to scale robot data without physics simulation or hardware. You take a phone scan + a single monocular human demo. It tracks the motion, renders photorealistic scenes, and generates diverse, robot-agnostic trajectories ready for training. > No teleop, no sim, no robot, just a phone and a video > Train VLA models and diffusion policies directly on the output > Supports multiple robot embodiments with kinematic consistency > 1000s of demos in 1/27 the time of real-world collection Thank you, Max Fu, for sharing!! Project: Paper: Code coming soon: It shows that with the right pipeline, you can scale robot learning data without touching a robot. One of the most interesting directions in scalable robotics today. —— Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

42,804 просмотров • 5 месяцев назад

Testing robot policies on hardware is slow, expensive and hard to scale. World models offer a promising path to accelerating robot policy development. We're sharing new research from the Runway Robotics team, in which we simulated 8 robot policies inside our General World Model and found 0.95 correlation with real-world results. Those early results point to world model simulation as a practical substitute for hardware evaluation, comparing favorably to existing real-to-sim approaches. Learn more at the link below.

Testing robot policies on hardware is slow, expensive and hard to scale. World models offer a promising path to accelerating robot policy development. We're sharing new research from the Runway Robotics team, in which we simulated 8 robot policies inside our General World Model and found 0.95 correlation with real-world results. Those early results point to world model simulation as a practical substitute for hardware evaluation, comparing favorably to existing real-to-sim approaches. Learn more at the link below.

Runway

13,464 просмотров • 4 месяцев назад

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Jim Fan

466,148 просмотров • 1 год назад

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: Project Website: Code:

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: Project Website: Code:

Max Zhaoshuo Li 李赵硕

1,078,049 просмотров • 1 месяц назад

📢 Announcing one of the most exciting works from us this year on **scalable robot policy evaluation through real-to-sim transfer**, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is **orders of magnitude slower** than the development of language agents. We propose a new framework where simulation performance **strongly correlates** with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing **state-of-the-art correlation** between simulation and reality for deformable object manipulation. It provides a **scalable and reproducible evaluation engine** for robot learning. 🌐

📢 Announcing one of the most exciting works from us this year on scalable robot policy evaluation through real-to-sim transfer, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is orders of magnitude slower than the development of language agents. We propose a new framework where simulation performance strongly correlates with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing state-of-the-art correlation between simulation and reality for deformable object manipulation. It provides a scalable and reproducible evaluation engine for robot learning. 🌐

Yunzhu Li

39,900 просмотров • 8 месяцев назад

Introducing Open-TeleVision: with Fully Autonomous policy video👇. We can conduct a long-horizon task with inserting 12 cans nonstop without any interruptions. We offer: 🤖 Highly precise and smooth bimanual manipulation. 📺 Active egocentric vision (with a moving neck) feedback. It is achieved by imitation learning from teleoperation: We propose a VR-based REAL-TIME teleoperation that streams the stereo video observation from the robot camera to the VR device. The robot neck moves as the human head moves, the robot hands move as the human hands move, offering the operator an intuitive experience as the human herself becomes the robot. The devils are all in the details, and how to implement things right: ✅How to perform IK/retargeting for smooth and precise control. ✅How to do all these and also stream stereo video without no latency, all in real time. We released our code here: Active head hardware design: 1/n

Introducing Open-TeleVision: with Fully Autonomous policy video👇. We can conduct a long-horizon task with inserting 12 cans nonstop without any interruptions. We offer: 🤖 Highly precise and smooth bimanual manipulation. 📺 Active egocentric vision (with a moving neck) feedback. It is achieved by imitation learning from teleoperation: We propose a VR-based REAL-TIME teleoperation that streams the stereo video observation from the robot camera to the VR device. The robot neck moves as the human head moves, the robot hands move as the human hands move, offering the operator an intuitive experience as the human herself becomes the robot. The devils are all in the details, and how to implement things right: ✅How to perform IK/retargeting for smooth and precise control. ✅How to do all these and also stream stereo video without no latency, all in real time. We released our code here: Active head hardware design: 1/n

Xiaolong Wang

25,572 просмотров • 2 лет назад

Robot policies must be both reliable and highly capable to be useful; the best way to achieve this level of performance is with reinforcement learning. However, for reinforcement learning you are usually stuck between two difficult options: reinforcement in the real world is often risky and expensive, while reinforcement learning in a traditional simulator takes a lot of engineering work and has a persistent sim-to-real gap. What if instead you could train your robot purely in a world model? RISE by Jiazhi Yang et al. uses a compositional world model to predict the future and evaluate progress. This allows for a self-improving pipeline, which learns a world model from real data and then learns how the robot should perform different tasks. This pipeline results in a data-driven way to improve policy performance from real data but without real-world reinforcement learning. Watch Episode #86 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

Robot policies must be both reliable and highly capable to be useful; the best way to achieve this level of performance is with reinforcement learning. However, for reinforcement learning you are usually stuck between two difficult options: reinforcement in the real world is often risky and expensive, while reinforcement learning in a traditional simulator takes a lot of engineering work and has a persistent sim-to-real gap. What if instead you could train your robot purely in a world model? RISE by Jiazhi Yang et al. uses a compositional world model to predict the future and evaluate progress. This allows for a self-improving pipeline, which learns a world model from real data and then learns how the robot should perform different tasks. This pipeline results in a data-driven way to improve policy performance from real data but without real-world reinforcement learning. Watch Episode #86 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

RoboPapers

38,334 просмотров • 1 месяц назад

Modern AI is confined to the digital world. At Skild AI, we are building towards AGI for the real world, unconstrained by robot type or task — a single, omni-bodied brain. Today, we are sharing our journey, starting with early milestones, with more to come in the weeks ahead. Our Mission: Artificial General Intelligence grounded in the physical world. We believe AGI that can truly understand and reason in the real world can only be built through grounding in the physical world. Our Vision: Any robot, Any task, One brain. We tackle robotics in its full generality – building a continually improving, omni-bodied brain that can control any hardware for any task. Who are we? A passionate group of scientists & engineers driven by our shared vision. We have been researching AI and robotics for more than a decade. Our team includes pioneers of self-supervised learning, curiosity-driven exploration, end-to-end sim2real for visual locomotion, dexterous manipulation, learning from human videos, robot parkour, and many more. Many of these works have won awards at top-tier AI and Robotics conferences. Our team has also built production-ready systems at Anduril, Tesla, Nvidia, Meta, Kitty Hawk, Google, Everyday Robotics, and Amazon. Join us in our mission to build the robot brains of tomorrow.

Modern AI is confined to the digital world. At Skild AI, we are building towards AGI for the real world, unconstrained by robot type or task — a single, omni-bodied brain. Today, we are sharing our journey, starting with early milestones, with more to come in the weeks ahead. Our Mission: Artificial General Intelligence grounded in the physical world. We believe AGI that can truly understand and reason in the real world can only be built through grounding in the physical world. Our Vision: Any robot, Any task, One brain. We tackle robotics in its full generality – building a continually improving, omni-bodied brain that can control any hardware for any task. Who are we? A passionate group of scientists & engineers driven by our shared vision. We have been researching AI and robotics for more than a decade. Our team includes pioneers of self-supervised learning, curiosity-driven exploration, end-to-end sim2real for visual locomotion, dexterous manipulation, learning from human videos, robot parkour, and many more. Many of these works have won awards at top-tier AI and Robotics conferences. Our team has also built production-ready systems at Anduril, Tesla, Nvidia, Meta, Kitty Hawk, Google, Everyday Robotics, and Amazon. Join us in our mission to build the robot brains of tomorrow.

Skild AI

382,615 просмотров • 11 месяцев назад

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The Humanoid Hub

68,453 просмотров • 6 месяцев назад

We RL'ed humanoid robots to Cristiano Ronaldo, LeBron James, and Kobe Byrant! These are neural nets running on real hardware at our GEAR lab. Most robot demos you see online speed videos up. We actually *slow them down* so you can enjoy the fluid motions. I'm excited to announce "ASAP", a "real2sim2real" model that masters extremely smooth and dynamic motions for humanoid whole body control. We pretrain the robot in simulation first, but there is a notorious "sim2real" gap: it's very difficult for hand-engineered physics equations to match real world dynamics. Our fix is simple: just deploy a pretrained policy on real hardware, collect data, and replay the motion in sim. The replay will obviously have many errors, but that gives a rich signal to compensate for the physics discrepancy. Use another neural net to learn the delta. Basically, we "patch up" a traditional physics engine, so that the robot can experience almost the real world at scale in GPUs. The future is hybrid simulation: combine the power of classical sim engines refined over decades and the uncanny ability of modern NNs to capture a messy world.

We RL'ed humanoid robots to Cristiano Ronaldo, LeBron James, and Kobe Byrant! These are neural nets running on real hardware at our GEAR lab. Most robot demos you see online speed videos up. We actually slow them down so you can enjoy the fluid motions. I'm excited to announce "ASAP", a "real2sim2real" model that masters extremely smooth and dynamic motions for humanoid whole body control. We pretrain the robot in simulation first, but there is a notorious "sim2real" gap: it's very difficult for hand-engineered physics equations to match real world dynamics. Our fix is simple: just deploy a pretrained policy on real hardware, collect data, and replay the motion in sim. The replay will obviously have many errors, but that gives a rich signal to compensate for the physics discrepancy. Use another neural net to learn the delta. Basically, we "patch up" a traditional physics engine, so that the robot can experience almost the real world at scale in GPUs. The future is hybrid simulation: combine the power of classical sim engines refined over decades and the uncanny ability of modern NNs to capture a messy world.

Jim Fan

566,544 просмотров • 1 год назад

I don’t know if we live in a Matrix, but I know for sure that robots will spend most of their lives in simulation. Let machines train machines. I’m excited to introduce DexMimicGen, a massive-scale synthetic data generator that enables a humanoid robot to learn complex skills from only a handful of human demonstrations. Yes, as few as 5! DexMimicGen addresses the biggest pain point in robotics: where do we get data? Unlike with LLMs, where vast amounts of texts are readily available, you cannot simply download motor control signals from the internet. So researchers teleoperate the robots to collect motion data via XR headsets. They have to repeat the same skill over and over and over again, because neural nets are data hungry. This is a very slow and uncomfortable process. At NVIDIA, we believe the majority of high-quality tokens for robot foundation models will come from simulation. What DexMimicGen does is to trade GPU compute time for human time. It takes one motion trajectory from human, and multiplies into 1000s of new trajectories. A robot brain trained on this augmented dataset will generalize far better in the real world. Think of DexMimicGen as a learning signal amplifier. It maps a small dataset to a large (de facto infinite) dataset, using physics simulation in the loop. In this way, we free humans from babysitting the bots all day. The future of robot data is generative. The future of the entire robot learning pipeline will also be generative. 🧵

I don’t know if we live in a Matrix, but I know for sure that robots will spend most of their lives in simulation. Let machines train machines. I’m excited to introduce DexMimicGen, a massive-scale synthetic data generator that enables a humanoid robot to learn complex skills from only a handful of human demonstrations. Yes, as few as 5! DexMimicGen addresses the biggest pain point in robotics: where do we get data? Unlike with LLMs, where vast amounts of texts are readily available, you cannot simply download motor control signals from the internet. So researchers teleoperate the robots to collect motion data via XR headsets. They have to repeat the same skill over and over and over again, because neural nets are data hungry. This is a very slow and uncomfortable process. At NVIDIA, we believe the majority of high-quality tokens for robot foundation models will come from simulation. What DexMimicGen does is to trade GPU compute time for human time. It takes one motion trajectory from human, and multiplies into 1000s of new trajectories. A robot brain trained on this augmented dataset will generalize far better in the real world. Think of DexMimicGen as a learning signal amplifier. It maps a small dataset to a large (de facto infinite) dataset, using physics simulation in the loop. In this way, we free humans from babysitting the bots all day. The future of robot data is generative. The future of the entire robot learning pipeline will also be generative. 🧵

Jim Fan

165,246 просмотров • 1 год назад

What if robots could learn real-world tasks from your perspective… without ever touching a robot? This is a system that trains robot policies using nothing but human-first, egocentric video data from smart glasses. No robots, no teleop, no sensors, just humans doing real tasks in the real world. Why it matters ✅ Learns robot policies from 20 minutes of human video; zero robot demos ✅ Generalizes to new objects, views, and even robot morphologies ✅ Uses 3D points for interpretable, spatially grounded learning ✅ Deploys directly to real-world robots with strong zero-shot success Thank you, Vincent Liu, for sharing!!! Learn more here: 🔗 Paper: 🌐 Website: 📍 BOOKMARK FOR LATER

What if robots could learn real-world tasks from your perspective… without ever touching a robot? This is a system that trains robot policies using nothing but human-first, egocentric video data from smart glasses. No robots, no teleop, no sensors, just humans doing real tasks in the real world. Why it matters ✅ Learns robot policies from 20 minutes of human video; zero robot demos ✅ Generalizes to new objects, views, and even robot morphologies ✅ Uses 3D points for interpretable, spatially grounded learning ✅ Deploys directly to real-world robots with strong zero-shot success Thank you, Vincent Liu, for sharing!!! Learn more here: 🔗 Paper: 🌐 Website: 📍 BOOKMARK FOR LATER

Ilir Aliu - eu/acc

10,509 просмотров • 1 год назад

Demonstrating our latest localization system in action. At bootup, the robot has no prior position estimate, but within seconds it autonomously localizes itself using a fusion of three algorithms. Unlike our previous version, this updated system incorporates vision, allowing the robot to adapt to real-world changes (e.g., moved furniture) by recognizing previously seen environments. In this demo, we repeatedly reset navigation and reposition the robot to random locations, showing robust, repeatable localization. The robot then executes a full patrol, following a planned path (visualized in RViz) with real-time path tracking. Because our software is hardware-agnostic, it brings the same reliable performance to any robot it runs on.

Demonstrating our latest localization system in action. At bootup, the robot has no prior position estimate, but within seconds it autonomously localizes itself using a fusion of three algorithms. Unlike our previous version, this updated system incorporates vision, allowing the robot to adapt to real-world changes (e.g., moved furniture) by recognizing previously seen environments. In this demo, we repeatedly reset navigation and reposition the robot to random locations, showing robust, repeatable localization. The robot then executes a full patrol, following a planned path (visualized in RViz) with real-time path tracking. Because our software is hardware-agnostic, it brings the same reliable performance to any robot it runs on.

OpenMind

23,835 просмотров • 2 месяцев назад

Introducing WorldForge: testable world-model workflows for physical AI systems. You can think of it, loosely, as “LangChain for world models”. The problem is that “world model” has become an overloaded label. Depending on context, it can mean a video generator, a cost model, a robot policy, a JEPA-style latent predictor, etc. They share almost nothing, different inputs, runtimes, failure modes. I built WorldForge to stop pretending they're interchangeable. Front-door demo: a real Hugging Face LeRobot (LeRobot) diffusion_pusht policy combined with LeWorldModel by Lucas Maes checkpoint for scoring. Both run locally on my MacBook in the demo video. LeWM is extremely efficient (~15M params), can plan up to 48× faster, and runs on commodity hardware. WorldForge wires the loop: policy → candidates → score → select Replay happens in a local TUI today, but the same loop could drive a real robot in the world. Would love feedback from people working on world models, physical AI, robotics, ML infra, and adjacent tooling. Fully open source. Contributions very welcome. Plan in the dream, replay in real world.

Introducing WorldForge: testable world-model workflows for physical AI systems. You can think of it, loosely, as “LangChain for world models”. The problem is that “world model” has become an overloaded label. Depending on context, it can mean a video generator, a cost model, a robot policy, a JEPA-style latent predictor, etc. They share almost nothing, different inputs, runtimes, failure modes. I built WorldForge to stop pretending they're interchangeable. Front-door demo: a real Hugging Face LeRobot (LeRobot) diffusion_pusht policy combined with LeWorldModel by Lucas Maes checkpoint for scoring. Both run locally on my MacBook in the demo video. LeWM is extremely efficient (~15M params), can plan up to 48× faster, and runs on commodity hardware. WorldForge wires the loop: policy → candidates → score → select Replay happens in a local TUI today, but the same loop could drive a real robot in the world. Would love feedback from people working on world models, physical AI, robotics, ML infra, and adjacent tooling. Fully open source. Contributions very welcome. Plan in the dream, replay in real world.

abdel

17,001 просмотров • 2 месяцев назад

It’s long been a dream of roboticists to be able to teach a robot in simulation so as to skip the long and expensive process of collecting large amounts of real-world training data. However, building simulations for robot tasks is extremely hard. Ideally, we could go from real data to a useful simulation. This is exactly what Guangqi Jiang and his co-authors do. they use 3d Gaussian splatting to reconstructed scenes which let them create interactive environments that, when combined with a physcs engine, allow for training robot policies that show zero-shot sim-to-real transfer (i.e., using no real-world demonstrations). To learn more, watch Episode 56 of Robopapers with Michael Cho - Rbt/Acc and Chris Paxton now!

It’s long been a dream of roboticists to be able to teach a robot in simulation so as to skip the long and expensive process of collecting large amounts of real-world training data. However, building simulations for robot tasks is extremely hard. Ideally, we could go from real data to a useful simulation. This is exactly what Guangqi Jiang and his co-authors do. they use 3d Gaussian splatting to reconstructed scenes which let them create interactive environments that, when combined with a physcs engine, allow for training robot policies that show zero-shot sim-to-real transfer (i.e., using no real-world demonstrations). To learn more, watch Episode 56 of Robopapers with Michael Cho - Rbt/Acc and Chris Paxton now!

RoboPapers

20,434 просмотров • 7 месяцев назад

NVIDIA just announced EgoScale 🤖🧠 NVIDIA Research has uncovered a log-linear scaling law for robot dexterity by pretraining VLA models on over 20,000 hours of egocentric human video This massive dataset is 20 times larger than previous efforts and proves that robot intelligence follows a predictable path: the more human data, the lower the loss The secret is a simple recipe combining large-scale human pretraining with a small amount of aligned human-robot mid-training to bridge the gap In testing, this method boosted the average success rate by 54% on a 22-DoF robotic hand compared to policies built without pretraining EgoScale also enables one-shot task adaptation and works across different hardware, suggesting that human motion is a universal motor prior for robots Website: Paper: Source: NVIDIA Research #Robot #Humanoid #Robotics #AI #EmbodiedAI #PhysicalAI #NVIDIA #EgoScale #GR00T

NVIDIA just announced EgoScale 🤖🧠 NVIDIA Research has uncovered a log-linear scaling law for robot dexterity by pretraining VLA models on over 20,000 hours of egocentric human video This massive dataset is 20 times larger than previous efforts and proves that robot intelligence follows a predictable path: the more human data, the lower the loss The secret is a simple recipe combining large-scale human pretraining with a small amount of aligned human-robot mid-training to bridge the gap In testing, this method boosted the average success rate by 54% on a 22-DoF robotic hand compared to policies built without pretraining EgoScale also enables one-shot task adaptation and works across different hardware, suggesting that human motion is a universal motor prior for robots Website: Paper: Source: NVIDIA Research #Robot #Humanoid #Robotics #AI #EmbodiedAI #PhysicalAI #NVIDIA #EgoScale #GR00T

RoboHub🤖

43,752 просмотров • 4 месяцев назад

Exciting updates on Project GR00T! We discover a systematic way to scale up robot data, tackling the most painful pain point in robotics. The idea is simple: human collects demonstration on a real robot, and we multiply that data 1000x or more in simulation. Let’s break it down: 1. We use Apple Vision Pro (yes!!) to give the human operator first person control of the humanoid. Vision Pro parses human hand pose and retargets the motion to the robot hand, all in real time. From the human’s point of view, they are immersed in another body like the Avatar. Teleoperation is slow and time-consuming, but we can afford to collect a small amount of data. 2. We use RoboCasa, a generative simulation framework, to multiply the demonstration data by varying the visual appearance and layout of the environment. In Jensen’s keynote video below, the humanoid is now placing the cup in hundreds of kitchens with a huge diversity of textures, furniture, and object placement. We only have 1 physical kitchen at the GEAR Lab in NVIDIA HQ, but we can conjure up infinite ones in simulation. 3. Finally, we apply MimicGen, a technique to multiply the above data even more by varying the motion of the robot. MimicGen generates vast number of new action trajectories based on the original human data, and filters out failed ones (e.g. those that drop the cup) to form a much larger dataset. To sum up, given 1 human trajectory with Vision Pro -> RoboCasa produces N (varying visuals) -> MimicGen further augments to NxM (varying motions). This is the way to trade compute for expensive human data by GPU-accelerated simulation. A while ago, I mentioned that teleoperation is fundamentally not scalable, because we are always limited by 24 hrs/robot/day in the world of atoms. Our new GR00T synthetic data pipeline breaks this barrier in the world of bits. Scaling has been so much fun for LLMs, and it's finally our turn to have fun in robotics! We are building tools to enable everyone in the ecosystem to scale up with us. Links in thread:

Jim Fan

364,380 просмотров • 2 лет назад

Tesla's Chief Designer Franz on Optimus: "It's going to be huge piece of Tesla, a huge piece in our world, as we look into the future, the ability to take menial tasks and things that you don't necessarily want to do and have them uploaded to somebody, a human, that can do that. We've been designing the robot that really looks at how humans work in the world. This world is designed around humans, and so I think having a robot that can do the things that humans can do can really start to take those menial tasks away and enrich in our lives, make our lives better we get to spend more time doing the things that we love. It's been incredibly fast paced, but the progress is, like, unbelievable in the short amount of time that the team has been working on it, and we've been growing the team and building the project the same time. I think we're breaking ground in this space everyday. The team is really energized and fired out. Elon Musk is incredibly involved in this. It's a great project to be a part of, and seeing where the future is gonna go and how it's going to enrich our lives is super exciting."

Tesla's Chief Designer Franz on Optimus: "It's going to be huge piece of Tesla, a huge piece in our world, as we look into the future, the ability to take menial tasks and things that you don't necessarily want to do and have them uploaded to somebody, a human, that can do that. We've been designing the robot that really looks at how humans work in the world. This world is designed around humans, and so I think having a robot that can do the things that humans can do can really start to take those menial tasks away and enrich in our lives, make our lives better we get to spend more time doing the things that we love. It's been incredibly fast paced, but the progress is, like, unbelievable in the short amount of time that the team has been working on it, and we've been growing the team and building the project the same time. I think we're breaking ground in this space everyday. The team is really energized and fired out. Elon Musk is incredibly involved in this. It's a great project to be a part of, and seeing where the future is gonna go and how it's going to enrich our lives is super exciting."

DogeDesigner

202,945 просмотров • 9 месяцев назад