Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Cosmos Policy turns a pretrained video diffusion model into a robot controller. Instead of redesigning the architecture, it injects robot state, actions, and values directly as latent frames inside the video model

Robots Digest 🤖

5,086 subscribers

22,933 views • 6 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇

We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇

Moo Jin Kim

149,398 views • 6 months ago

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

The Humanoid Hub

11,575 views • 5 months ago

A humanoid robot policy trained solely on synthetic data generated by a world model. Research Scientist Joel Jang presents NVIDIA's DreamGen pipeline: ⦿ Post-train the world model Cosmos-Predict2 with a small set of real teleoperation demos. ⦿ Prompt the world model to generate synthetic video data with verbs and scenarios not used in the world model’s post-training. ⦿ Auto-label synthetic video data with action sequences. ⦿ Train robot policies using only synthetic data. That's it. Deploy zero-shot to a real humanoid robot.

A humanoid robot policy trained solely on synthetic data generated by a world model. Research Scientist Joel Jang presents NVIDIA's DreamGen pipeline: ⦿ Post-train the world model Cosmos-Predict2 with a small set of real teleoperation demos. ⦿ Prompt the world model to generate synthetic video data with verbs and scenarios not used in the world model’s post-training. ⦿ Auto-label synthetic video data with action sequences. ⦿ Train robot policies using only synthetic data. That's it. Deploy zero-shot to a real humanoid robot.

The Humanoid Hub

20,968 views • 1 year ago

World Model meets robot policy! Robbyant's LingBot-VA: unifies video world modeling and robotic policy learning. - A single model generates both future video and the actions to make it real. - Long-term memory enables long-horizon tasks. - Claims significant outperformance over π₀.₅ in real-world tasks. - It's open-source

World Model meets robot policy! Robbyant's LingBot-VA: unifies video world modeling and robotic policy learning. - A single model generates both future video and the actions to make it real. - Long-term memory enables long-horizon tasks. - Claims significant outperformance over π₀.₅ in real-world tasks. - It's open-source

The Humanoid Hub

17,721 views • 5 months ago

Diffusion has shown great promise for generating robot **actions**, can it act as a **world model** to generate the future conditioned on actions? In our work led by han qi Haocheng Yin and in collaboration with Yilun Du, we show a **controllable** action-conditioned video diffusion model can produce photorealistic and (near) physics-accurate future predictions. This ability strengthens the policy via: - ranking different action proposals and selecting the best, or - **visual** trajectory optimization by optimizing the action proposals using gradient ascent. Learn more about Generative Predictive Control (GPC) at:

Diffusion has shown great promise for generating robot actions, can it act as a world model to generate the future conditioned on actions? In our work led by han qi Haocheng Yin and in collaboration with Yilun Du, we show a controllable action-conditioned video diffusion model can produce photorealistic and (near) physics-accurate future predictions. This ability strengthens the policy via: - ranking different action proposals and selecting the best, or - visual trajectory optimization by optimizing the action proposals using gradient ascent. Learn more about Generative Predictive Control (GPC) at:

Heng Yang

38,428 views • 1 year ago

What happens when robot world models learn from human experience at scale? 🤔 DreamDojo from NVIDIA Research is a generalist robot world model pretrained on 44K hours of egocentric human videos and then post-trained on robot data to generalize across new objects and environments. After distillation, it runs at 10 FPS for live teleoperation, policy evaluation, and model-based planning. Read the ICML paper to learn more 📄

What happens when robot world models learn from human experience at scale? 🤔 DreamDojo from NVIDIA Research is a generalist robot world model pretrained on 44K hours of egocentric human videos and then post-trained on robot data to generalize across new objects and environments. After distillation, it runs at 10 FPS for live teleoperation, policy evaluation, and model-based planning. Read the ICML paper to learn more 📄

NVIDIA Robotics

22,322 views • 18 days ago

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048 abs: project page:

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048 abs: project page:

AK

718,760 views • 3 years ago

We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇

We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇

Agrim Gupta

431,168 views • 2 years ago

Today mimic and friends are excited to share mimic-video, a new class of Video-Action Model that elevates video model backbones as first class citizens for robot learning!

Today mimic and friends are excited to share mimic-video, a new class of Video-Action Model that elevates video model backbones as first class citizens for robot learning!

Elvis Nava

87,118 views • 7 months ago

🚀 We’re excited to announce LingBot-VA, a new state-of-the-art robot policy model from Robbyant ! LingBot-VA is built on a causal, autoregressive video-action world model for generalist robot control. Highlights: (1) First unified autoregressive video-action world model for robot control (2) Low-latency inference with a new asynchronous execution pipeline (3) SOTA on RoboTwin (92.9%, firstever > 90%) and LIBERO (98.5%) (4) +20% over π0.5 on challenging real-world long-horizon & high-precision tasks

🚀 We’re excited to announce LingBot-VA, a new state-of-the-art robot policy model from Robbyant ! LingBot-VA is built on a causal, autoregressive video-action world model for generalist robot control. Highlights: (1) First unified autoregressive video-action world model for robot control (2) Low-latency inference with a new asynchronous execution pipeline (3) SOTA on RoboTwin (92.9%, firstever > 90%) and LIBERO (98.5%) (4) +20% over π0.5 on challenging real-world long-horizon & high-precision tasks

Yinghao Xu

46,962 views • 5 months ago

Thrilled to announce Octo 🐙, an open-source robot foundation model! Octo is a sota generalist robot policy based on transformer+diffusion. Most importantly, you can finetune Octo *today* with flexible observation and action spaces on your robot setup!

Thrilled to announce Octo 🐙, an open-source robot foundation model! Octo is a sota generalist robot policy based on transformer+diffusion. Most importantly, you can finetune Octo today with flexible observation and action spaces on your robot setup!

Oier Mees

44,944 views • 2 years ago

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.

Shuo Yang

31,596 views • 4 months ago

This looks like another step toward removing the data bottleneck in robotics. The bottleneck exists because recording demonstrations is easy, but labeling every grasp, hold, move, and release is slow and expensive. Perceptron Egocentric dropped a new robotics video annotation system. You give it raw robot-camera or first-person/egocentric video. It gives back structured labels that look more like training data: where each small manipulation starts and ends, what the subtask is, what each hand is doing, where the hands are frame by frame, and which hand is left or right. So, Input: raw video of a person or robot doing a task. Output: a machine-readable breakdown of the physical actions inside that video. Basically, it is a machine that converts robot video into policy-training supervision. That means timestamps, subtask boundaries, per-hand actions, and left-right hand grounding. So this will turn large piles of robot or egocentric video into supervision for training robot policies.

This looks like another step toward removing the data bottleneck in robotics. The bottleneck exists because recording demonstrations is easy, but labeling every grasp, hold, move, and release is slow and expensive. Perceptron Egocentric dropped a new robotics video annotation system. You give it raw robot-camera or first-person/egocentric video. It gives back structured labels that look more like training data: where each small manipulation starts and ends, what the subtask is, what each hand is doing, where the hands are frame by frame, and which hand is left or right. So, Input: raw video of a person or robot doing a task. Output: a machine-readable breakdown of the physical actions inside that video. Basically, it is a machine that converts robot video into policy-training supervision. That means timestamps, subtask boundaries, per-hand actions, and left-right hand grounding. So this will turn large piles of robot or egocentric video into supervision for training robot policies.

Rohan Paul

18,391 views • 15 days ago

PaLM-E or GPT-4 can speak in many languages and understand images. What if they could speak robot actions? Introducing RT-2: our new model that uses a VLM (up to 55B params) backbone and fine-tunes it to directly output robot actions!

PaLM-E or GPT-4 can speak in many languages and understand images. What if they could speak robot actions? Introducing RT-2: our new model that uses a VLM (up to 55B params) backbone and fine-tunes it to directly output robot actions!

Karol Hausman

182,789 views • 3 years ago

This system uses a KUKA robot to create intricate, 3D designs in cocktails. The robot "injects" microliter drops of edible liquid into a cocktail, instead of building up objects layer by layer, as with a normal 3D printer. Video Credit: KUKA #engineering #technology #3dprinting #additivemanufacturing ----------------------- Wanna get your company on Wevolver too? Learn how:

This system uses a KUKA robot to create intricate, 3D designs in cocktails. The robot "injects" microliter drops of edible liquid into a cocktail, instead of building up objects layer by layer, as with a normal 3D printer. Video Credit: KUKA #engineering #technology #3dprinting #additivemanufacturing ----------------------- Wanna get your company on Wevolver too? Learn how:

Wevolver

353,664 views • 10 months ago

Six months ago, on the first-ever episode of The Humanoid Hub, Eric Jang discussed World Models as something that wasn't being taken seriously as the core of AI systems. Today, 1X introduced a robotics policy that turns that vision into reality, converting video generation from World Models into real-world robot actions.

Six months ago, on the first-ever episode of The Humanoid Hub, Eric Jang discussed World Models as something that wasn't being taken seriously as the core of AI systems. Today, 1X introduced a robotics policy that turns that vision into reality, converting video generation from World Models into real-world robot actions.

The Humanoid Hub

30,933 views • 6 months ago

GameFactory Creating New Games with Generative Interactive Videos present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos. By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.

GameFactory Creating New Games with Generative Interactive Videos present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos. By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.

AK

70,029 views • 1 year ago

Tired of teleoperating your robots? We built a way to scale robot datasets without teleop, dynamic simulation, or even robot hardware. Just one smartphone scan + one human hand demo video → thousands of diverse robot trajectories. Trainable by diffusion policy and VLA models as-is. Introducing: Real2Render2Real 👉

Tired of teleoperating your robots? We built a way to scale robot datasets without teleop, dynamic simulation, or even robot hardware. Just one smartphone scan + one human hand demo video → thousands of diverse robot trajectories. Trainable by diffusion policy and VLA models as-is. Introducing: Real2Render2Real 👉

Max Fu

69,371 views • 1 year ago

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: Project Website: Code:

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: Project Website: Code:

Max Zhaoshuo Li 李赵硕

1,078,049 views • 1 month ago

Inception Labs just killed the transformer. They released Mercury 2, the world's first "diffusion" reasoning model. It's fast, and it uses a completely new model architecture... just watch this 11 min video to find out more:

Inception Labs just killed the transformer. They released Mercury 2, the world's first "diffusion" reasoning model. It's fast, and it uses a completely new model architecture... just watch this 11 min video to find out more:

David Ondrej

45,692 views • 4 months ago