Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

🤔Want a principled way to RL your diffusion model? Check Data-regularized Reinforcement Learning (DDRL)! Post-train NVIDIA #Cosmos World Foundation models with a million GPU hours! 🤯 Novel formulation ➡️ Theoretically integrates SFT into RL ➡️ Robust to Reward Hacking 🛑 Details: #DDRL #Diffusion #RL #NVIDIA #Cosmos

Haotian Ye

1,084 subscribers

77,923 просмотров • 7 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

NVIDIA just introduced Cosmos, a platform for world foundation models designed for robotics. ⦿ It features advanced tokenizers, an AI-accelerated data pipeline, and integration with NVIDIA Omniverse. Humanoid makers 1X, Figure, and Agility are among the first to adopt Cosmos. ⦿ Cosmos generates synthetic, physics-based data, accelerating model training and customization. ⦿ It also features a CUDA-accelerated data processing pipeline that enables developers to process, curate, and label 20 million hours of videos in 14 days using the NVIDIA Blackwell platform.

NVIDIA just introduced Cosmos, a platform for world foundation models designed for robotics. ⦿ It features advanced tokenizers, an AI-accelerated data pipeline, and integration with NVIDIA Omniverse. Humanoid makers 1X, Figure, and Agility are among the first to adopt Cosmos. ⦿ Cosmos generates synthetic, physics-based data, accelerating model training and customization. ⦿ It also features a CUDA-accelerated data processing pipeline that enables developers to process, curate, and label 20 million hours of videos in 14 days using the NVIDIA Blackwell platform.

The Humanoid Hub

129,383 просмотров • 1 год назад

Cosmos Policy just dropped for robotics. 🤖 Cutting edge research is turning a world foundation model into a unified robot brain that can see, predict, and act—no extra action heads, no complicated control stack. Read our blog on Hugging Face ➡️ Want to get hands-on with Cosmos (Reason, Predict, Policy, Cookbook)? Join the Cosmos Cookoff, sponsored by Nebius and Milestone Systems ➡️

Cosmos Policy just dropped for robotics. 🤖 Cutting edge research is turning a world foundation model into a unified robot brain that can see, predict, and act—no extra action heads, no complicated control stack. Read our blog on Hugging Face ➡️ Want to get hands-on with Cosmos (Reason, Predict, Policy, Cookbook)? Join the Cosmos Cookoff, sponsored by Nebius and Milestone Systems ➡️

NVIDIA Robotics

14,344 просмотров • 6 месяцев назад

So we did a bunch of projects with real world reinforcement learning - but it was often too inefficient to be practical to train tabula rasa. This suggests we need better priors, but acquiring these from on-robot data can often be expensive as well. In our recent work, we show that despite being fundamentally inaccurate, simulation can guide provide a cheap way to guide real-world RL finetuning to be super efficient! We propose Simulation-Guided Fine-Tuning (SGFT) - a simple paradigm for sim2real finetuning that uses simulation to provide reward shaping that accelerates real world RL finetuning *beyond* just providing an initialization. TLDR: Use value functions from sim to shape rewards for real-world RL, see large sample efficiency improvements 🧵(1/6)

So we did a bunch of projects with real world reinforcement learning - but it was often too inefficient to be practical to train tabula rasa. This suggests we need better priors, but acquiring these from on-robot data can often be expensive as well. In our recent work, we show that despite being fundamentally inaccurate, simulation can guide provide a cheap way to guide real-world RL finetuning to be super efficient! We propose Simulation-Guided Fine-Tuning (SGFT) - a simple paradigm for sim2real finetuning that uses simulation to provide reward shaping that accelerates real world RL finetuning beyond just providing an initialization. TLDR: Use value functions from sim to shape rewards for real-world RL, see large sample efficiency improvements 🧵(1/6)

Abhishek Gupta

13,637 просмотров • 1 год назад

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

DeepLearning.AI

16,779 просмотров • 1 год назад

The network for machine intelligence Two years ago, we laid out our vision for a machine learning compute protocol. One that connects every device in the world into an open network for machine intelligence, with no gatekeepers or artificial boundaries. This week, we’ll be sharing some of our early progress, beginning with RL Swarm, a peer-to-peer system for collaborative reinforcement learning over the internet. Next month, we’ll open our Testnet, allowing anyone to contribute to the frontier of open machine intelligence. Introducing RL Swarm RL Swarm is a fully open source system for collaborative reinforcement learning over the internet. It is a live demo of our research findings, which show that models training with RL learn faster when they train as a collective swarm than they do on their own. Join our swarm now to see this in practice. You can participate with consumer hardware at home or a powerful GPU in the cloud. You can follow along with the swarm’s progress by following the links below.

The network for machine intelligence Two years ago, we laid out our vision for a machine learning compute protocol. One that connects every device in the world into an open network for machine intelligence, with no gatekeepers or artificial boundaries. This week, we’ll be sharing some of our early progress, beginning with RL Swarm, a peer-to-peer system for collaborative reinforcement learning over the internet. Next month, we’ll open our Testnet, allowing anyone to contribute to the frontier of open machine intelligence. Introducing RL Swarm RL Swarm is a fully open source system for collaborative reinforcement learning over the internet. It is a live demo of our research findings, which show that models training with RL learn faster when they train as a collective swarm than they do on their own. Join our swarm now to see this in practice. You can participate with consumer hardware at home or a powerful GPU in the cloud. You can follow along with the swarm’s progress by following the links below.

gensyn

228,892 просмотров • 1 год назад

I like to think of evals as something active, not passive -- it's a North Star that steers LLMs toward higher intelligence. Evals should drive your RL/SFT/post-training decisions. Internal evals at frontier labs make a huge difference -- and you can see it in how models behave differently (GPT seems better at one-shot tasks, Claude at multi-turn). If you want to learn more about building evals that actually improve your model in post-training, check out our AMD x DeeplearningAI course "Fine-tuning & RL for LLMs: Intro to Post-training" (content is free):

I like to think of evals as something active, not passive -- it's a North Star that steers LLMs toward higher intelligence. Evals should drive your RL/SFT/post-training decisions. Internal evals at frontier labs make a huge difference -- and you can see it in how models behave differently (GPT seems better at one-shot tasks, Claude at multi-turn). If you want to learn more about building evals that actually improve your model in post-training, check out our AMD x DeeplearningAI course "Fine-tuning & RL for LLMs: Intro to Post-training" (content is free):

Sharon Zhou

16,836 просмотров • 5 месяцев назад

🚨 RL for LLMs is finally accessible. Introducing OpenTinker: The first community-driven, open-source framework designed to democratize Reinforcement Learning for LLMs. Inspired by Thinking Machines's amazing Tinker, we realize the biggest bottleneck in agentic LLM research isn’t the math—it’s the setup. Current RL pipelines are messy. Configuring VeRL for every single experiment is a productivity killer. OpenTinker fixed it. 🛠 How OpenTinker Works: Decoupled Design of Server and Client - Setup Once, Run Forever: Configure the OpenTinker backend on your GPU cluster once. - Develop Locally: Define your RL environments directly on your laptop. - Train on the Cloud: Simply point your local client to the backend. The cluster handles the compute; you handle the science. 📉 The 10x Development Efficiency Thanks to our elegant architectural decomposition, OpenTinker reduces the time to develop a new RL training pipeline by at least an order of magnitude. ⚡ Turn Idle GPU Compute into Gold Small labs often have underutilized hardware. OpenTinker turns your idle GPUs into an internal/external API service for - RL Training - SFT - Inference 🎯 Who needs OpenTinker? - Researchers tired of infrastructure hell. - Labs needing to standardize workflows. - Teams wanting to maximize hardware ROI. Thanks my amazing PhD student Siqi Zhu for leading the project. We are building the future of open RL infra. Be the first to build with us. 👇 Start Building with OpenTinker Now 🚀 Repo: 🌐 Blog: If you believe RL should be accessible to everyone, give us a star, repost this 🔄 post, and let us know what agents you plan to build!

🚨 RL for LLMs is finally accessible. Introducing OpenTinker: The first community-driven, open-source framework designed to democratize Reinforcement Learning for LLMs. Inspired by Thinking Machines's amazing Tinker, we realize the biggest bottleneck in agentic LLM research isn’t the math—it’s the setup. Current RL pipelines are messy. Configuring VeRL for every single experiment is a productivity killer. OpenTinker fixed it. 🛠 How OpenTinker Works: Decoupled Design of Server and Client - Setup Once, Run Forever: Configure the OpenTinker backend on your GPU cluster once. - Develop Locally: Define your RL environments directly on your laptop. - Train on the Cloud: Simply point your local client to the backend. The cluster handles the compute; you handle the science. 📉 The 10x Development Efficiency Thanks to our elegant architectural decomposition, OpenTinker reduces the time to develop a new RL training pipeline by at least an order of magnitude. ⚡ Turn Idle GPU Compute into Gold Small labs often have underutilized hardware. OpenTinker turns your idle GPUs into an internal/external API service for - RL Training - SFT - Inference 🎯 Who needs OpenTinker? - Researchers tired of infrastructure hell. - Labs needing to standardize workflows. - Teams wanting to maximize hardware ROI. Thanks my amazing PhD student Siqi Zhu for leading the project. We are building the future of open RL infra. Be the first to build with us. 👇 Start Building with OpenTinker Now 🚀 Repo: 🌐 Blog: If you believe RL should be accessible to everyone, give us a star, repost this 🔄 post, and let us know what agents you plan to build!

Jiaxuan You

58,224 просмотров • 7 месяцев назад

The power of generative models — now embodied in humanoids. Announcing DreamControl –– After a year-long research effort at General Robotics — we present a scalable framework for whole-body humanoid control that fuses diffusion priors with reinforcement learning to unlock real-world scene interaction. Diffusion + RL → natural whole-body skills on real robots. DreamControl enables humanoids to move beyond locomotion demos → performing natural, human-like skills such as –– Picking & lifting objects, Opening drawers & doors, Precise punching, kicking, and jumping, Bimanual manipulation tasks Our key innovation: a diffusion prior over human motion that guides RL, eliminating the need for massive teleoperation datasets, and producing motions that look human while transferring to real hardware. Trained purely in simulation, deployed on the Unitree G1 humanoid, DreamControl policies run in real time, bridging sim-to-real with unprecedented naturalness. We leverage a novel hybrid edge + cloud infrastructure that runs RL-trained policies on the edge backed by powerful AI models running in the cloud This is the next step in General Robotics’ journey toward general-purpose humanoid assistants that interact, adapt, and assist autonomously. Paper: Blog: 1/n

The power of generative models — now embodied in humanoids. Announcing DreamControl –– After a year-long research effort at General Robotics — we present a scalable framework for whole-body humanoid control that fuses diffusion priors with reinforcement learning to unlock real-world scene interaction. Diffusion + RL → natural whole-body skills on real robots. DreamControl enables humanoids to move beyond locomotion demos → performing natural, human-like skills such as –– Picking & lifting objects, Opening drawers & doors, Precise punching, kicking, and jumping, Bimanual manipulation tasks Our key innovation: a diffusion prior over human motion that guides RL, eliminating the need for massive teleoperation datasets, and producing motions that look human while transferring to real hardware. Trained purely in simulation, deployed on the Unitree G1 humanoid, DreamControl policies run in real time, bridging sim-to-real with unprecedented naturalness. We leverage a novel hybrid edge + cloud infrastructure that runs RL-trained policies on the edge backed by powerful AI models running in the cloud This is the next step in General Robotics’ journey toward general-purpose humanoid assistants that interact, adapt, and assist autonomously. Paper: Blog: 1/n

Ashish Kapoor

118,133 просмотров • 10 месяцев назад

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 просмотров • 1 год назад

Explore where physical AI is headed next with the people shaping it. On Feb. 10 at 9 a.m. PT, join Ming‑Yu, Vice President of Research at NVIDIA, and Chubby♨️ from Superintelligence, for a conversation on how NVIDIA Cosmos reframes AI for a world of robots, AVs, and vision-driven systems. We'll dig into: ✅ Cosmos' origins and designing models for the physical world ✅ The scientific ideas and architectures behind Cosmos and its open models ✅ How open, reasoning-capable physical AI is changing how teams prototype, deploy, and iterate ✅ What a future of AI systems that understand space, time, and causality could unlock If you’re thinking into real‑world AI, this is the conversation to catch ➡️

Explore where physical AI is headed next with the people shaping it. On Feb. 10 at 9 a.m. PT, join Ming‑Yu, Vice President of Research at NVIDIA, and Chubby♨️ from Superintelligence, for a conversation on how NVIDIA Cosmos reframes AI for a world of robots, AVs, and vision-driven systems. We'll dig into: ✅ Cosmos' origins and designing models for the physical world ✅ The scientific ideas and architectures behind Cosmos and its open models ✅ How open, reasoning-capable physical AI is changing how teams prototype, deploy, and iterate ✅ What a future of AI systems that understand space, time, and causality could unlock If you’re thinking into real‑world AI, this is the conversation to catch ➡️

NVIDIA AI Developer

26,434 просмотров • 5 месяцев назад

Today's Training Data episode takes us BTS on the infrastructure challenges required to do large RL runs at scale, featuring Federico Cassano (Composer Lead at Cursor) and Dmytro Dzhulgakov (Co-Founder at Fireworks). The Cursor team trained Composer 2 on Fireworks by starting with a strong base model (Kimi 2.5) and performing large-scale mid-training on code tokens and web data to learn common patterns and libraries, followed by a large-scale Reinforcement Learning run to learn how to navigate the Cursor harness, call tools, and write correct code. Today's episode dives into the systems and infrastructure challenges of making that large RL run happening, and there were many (!!), from numerical mismatch to global distribution to synchronizing rollouts across asynchronous pipelines to keeping track of expert activation across runs and more. Extremely nerdy in-the-weeds challenges that Federico and Dima were delighted to nerd out on together :) Beyond RL infra, we also discussed Online vs Simulated rollouts, self-summarization for long-horizon agents, environment design ("the most powerful RL environment is the product itself"), and other technical nuggets. PS: We filmed this episode before the SpaceX news, while the Cursor team was still compute-constrained. While Cursor now has *all* the flops, the takeaways and hurdles crossed ring true for any serious application-level company that is racing to post-train their own models. I believe that more serious application companies will go the way of Cursor and post-train their own models. 00:00 Introduction 00:53 Why Cursor Trained Composer 2 04:55 Specialization vs Bitter Lesson 06:16 Composer 2 Training Recipe 16:32 Scaling RL Infrastructure Globally 23:32 Floating Point Drift 25:11 MoE Sensitivity Explained 26:25 Router Replay Fix 27:19 Real Time RL Loop 31:49 Long Horizon Agents 34:29 Why RL Everywhere 37:34 LLM as Judge Rewards 39:14 RL in Hard Domains 40:13 Build Your Own Environments 44:34 Closing Thoughts

Today's Training Data episode takes us BTS on the infrastructure challenges required to do large RL runs at scale, featuring Federico Cassano (Composer Lead at Cursor) and Dmytro Dzhulgakov (Co-Founder at Fireworks). The Cursor team trained Composer 2 on Fireworks by starting with a strong base model (Kimi 2.5) and performing large-scale mid-training on code tokens and web data to learn common patterns and libraries, followed by a large-scale Reinforcement Learning run to learn how to navigate the Cursor harness, call tools, and write correct code. Today's episode dives into the systems and infrastructure challenges of making that large RL run happening, and there were many (!!), from numerical mismatch to global distribution to synchronizing rollouts across asynchronous pipelines to keeping track of expert activation across runs and more. Extremely nerdy in-the-weeds challenges that Federico and Dima were delighted to nerd out on together :) Beyond RL infra, we also discussed Online vs Simulated rollouts, self-summarization for long-horizon agents, environment design ("the most powerful RL environment is the product itself"), and other technical nuggets. PS: We filmed this episode before the SpaceX news, while the Cursor team was still compute-constrained. While Cursor now has all the flops, the takeaways and hurdles crossed ring true for any serious application-level company that is racing to post-train their own models. I believe that more serious application companies will go the way of Cursor and post-train their own models. 00:00 Introduction 00:53 Why Cursor Trained Composer 2 04:55 Specialization vs Bitter Lesson 06:16 Composer 2 Training Recipe 16:32 Scaling RL Infrastructure Globally 23:32 Floating Point Drift 25:11 MoE Sensitivity Explained 26:25 Router Replay Fix 27:19 Real Time RL Loop 31:49 Long Horizon Agents 34:29 Why RL Everywhere 37:34 LLM as Judge Rewards 39:14 RL in Hard Domains 40:13 Build Your Own Environments 44:34 Closing Thoughts

Sonya Huang 🐥

79,834 просмотров • 2 месяцев назад

Ex-NVIDIA engineer who built Unsloth explained RL, kernels, reasoning, quantization, and agents in 2 hours 42 minutes - better than $5000 fine-tuning bootcamps. pick the base model -> write triton kernels for 2x faster fine-tune -> quantize to 4-bit -> run GRPO/DPO -> ship a reasoning model on your single GPU. That loop is why Unsloth is the default way to fine-tune Llama, Qwen, Gemma, and Phi on hardware you already own. Unsloth + Triton kernels + 4-bit quantization + GRPO/DPO + single-GPU fine-tuning - that's the stack. Watch and save it, then fine-tune your first model tonight.

Ex-NVIDIA engineer who built Unsloth explained RL, kernels, reasoning, quantization, and agents in 2 hours 42 minutes - better than $5000 fine-tuning bootcamps. pick the base model -> write triton kernels for 2x faster fine-tune -> quantize to 4-bit -> run GRPO/DPO -> ship a reasoning model on your single GPU. That loop is why Unsloth is the default way to fine-tune Llama, Qwen, Gemma, and Phi on hardware you already own. Unsloth + Triton kernels + 4-bit quantization + GRPO/DPO + single-GPU fine-tuning - that's the stack. Watch and save it, then fine-tune your first model tonight.

h100envy

490,734 просмотров • 21 дней назад

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.

Shuo Yang

31,596 просмотров • 4 месяцев назад

"One of the very confusing things about the models right now: how to reconcile the fact that they are doing so well on evals. And you look at the evals and you go, 'Those are pretty hard evals.' But the economic impact seems to be dramatically behind. There is [a possible] explanation. Back when people were doing pre-training, the question of what data to train on was answered, because that answer was everything. So you don't have to think if it's going to be this data or that data. When people do RL training, they say, 'Okay, we want to have this kind of RL training for this thing and that kind of RL training for that thing.' You say, 'Hey, I would love our model to do really well when we release it. I want the evals to look great. What would be RL training that could help on this task?' If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance"

"One of the very confusing things about the models right now: how to reconcile the fact that they are doing so well on evals. And you look at the evals and you go, 'Those are pretty hard evals.' But the economic impact seems to be dramatically behind. There is [a possible] explanation. Back when people were doing pre-training, the question of what data to train on was answered, because that answer was everything. So you don't have to think if it's going to be this data or that data. When people do RL training, they say, 'Okay, we want to have this kind of RL training for this thing and that kind of RL training for that thing.' You say, 'Hey, I would love our model to do really well when we release it. I want the evals to look great. What would be RL training that could help on this task?' If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance"

Dwarkesh Patel

502,249 просмотров • 8 месяцев назад

Today, we're joined by Nikita Rudin, co-founder and CEO of Flexion to discuss the gap between current robotic capabilities and what’s required to deploy fully autonomous robots in the real world. Nikita explains how reinforcement learning and simulation have driven rapid progress in robot locomotion—and why locomotion is still far from “solved.” We dig into the sim2real gap, and how adding visual inputs introduces noise and significantly complicates sim-to-real transfer. We also explore the debate between end-to-end models and modular approaches, and why separating locomotion, planning, and semantics remains a pragmatic approach today. Nikita also introduces the concept of "real-to-sim", which uses real-world data to refine simulation parameters for higher fidelity training, discusses how reinforcement learning, imitation learning, and teleoperation data are combined to train robust policies for both quadruped and humanoid robots, and introduces Flexion's hierarchical approach that utilizes pre-trained Vision-Language Models (VLMs) for high-level task orchestration with Vision-Language-Action (VLA) models and low-level whole-body trackers. Finally, Nikita shares the behind-the-scenes in humanoid robot demos, his take on reinforcement learning in simulation versus the real world, the nuances of reward tuning, and offers practical advice for researchers and practitioners looking to get started in robotics today. 🗒️ For the full list of resources for this episode, visit the show notes page: 📖 CHAPTERS =============================== 00:00 - Introduction 04:07 - Is robot locomotion solved? 06:04 - Sim-to-real gap 08:58 - Adding semantics to policies 09:42 - Modular vs end-to-end architectures 10:29 - Planner model 12:21 - Adapting RL techniques from quadrupeds to humanoids 15:39 - Behind robot demos 18:09 - Humanoid robots in home environments 22:03 - Training approach 23:56 - VLA models 27:59 - Closing the sim-to-real gap 32:55 - Task orchestration using VLMs 36:38 - Tool use 38:10 - Model hierarchy 43:37 - Simulator versus simulation environment 44:57 - Combining imitation learning and reinforcement learning 46:42 - RL in real world versus RL in simulation 52:58 - Reward tuning and value functions in robotics 56:38 - Predictions 1:00:10 - Humanoids, quadropeds, and wheeled platforms 1:02:45 - Advice, recommended robot kits, and community pla

Today, we're joined by Nikita Rudin, co-founder and CEO of Flexion to discuss the gap between current robotic capabilities and what’s required to deploy fully autonomous robots in the real world. Nikita explains how reinforcement learning and simulation have driven rapid progress in robot locomotion—and why locomotion is still far from “solved.” We dig into the sim2real gap, and how adding visual inputs introduces noise and significantly complicates sim-to-real transfer. We also explore the debate between end-to-end models and modular approaches, and why separating locomotion, planning, and semantics remains a pragmatic approach today. Nikita also introduces the concept of "real-to-sim", which uses real-world data to refine simulation parameters for higher fidelity training, discusses how reinforcement learning, imitation learning, and teleoperation data are combined to train robust policies for both quadruped and humanoid robots, and introduces Flexion's hierarchical approach that utilizes pre-trained Vision-Language Models (VLMs) for high-level task orchestration with Vision-Language-Action (VLA) models and low-level whole-body trackers. Finally, Nikita shares the behind-the-scenes in humanoid robot demos, his take on reinforcement learning in simulation versus the real world, the nuances of reward tuning, and offers practical advice for researchers and practitioners looking to get started in robotics today. 🗒️ For the full list of resources for this episode, visit the show notes page: 📖 CHAPTERS =============================== 00:00 - Introduction 04:07 - Is robot locomotion solved? 06:04 - Sim-to-real gap 08:58 - Adding semantics to policies 09:42 - Modular vs end-to-end architectures 10:29 - Planner model 12:21 - Adapting RL techniques from quadrupeds to humanoids 15:39 - Behind robot demos 18:09 - Humanoid robots in home environments 22:03 - Training approach 23:56 - VLA models 27:59 - Closing the sim-to-real gap 32:55 - Task orchestration using VLMs 36:38 - Tool use 38:10 - Model hierarchy 43:37 - Simulator versus simulation environment 44:57 - Combining imitation learning and reinforcement learning 46:42 - RL in real world versus RL in simulation 52:58 - Reward tuning and value functions in robotics 56:38 - Predictions 1:00:10 - Humanoids, quadropeds, and wheeled platforms 1:02:45 - Advice, recommended robot kits, and community pla

The TWIML AI Podcast

22,582 просмотров • 6 месяцев назад

Why AI Progress Suddenly Feels Real - my conversation with Yann Dubois, who co-leads the Post-Training Frontiers team at OpenAI 00:00 - Intro 01:30 - Why recent AI progress feels like a step function 04:13 - Model reliability & the emotional rollercoaster of shipping GPT-5.5 07:33 - How OpenAI structures vertical and horizontal teams 09:49 - Improving model efficiency and test-time compute 12:32 - Yann's journey from Switzerland to OpenAI 15:37 - Reasoning in 2026: Real-world utility vs verifiable rewards 18:34 - GPT-5.5 Thinking vs Pro: Scaling test-time compute 20:09 - How reasoning models become more efficient 23:23 - Pre-training scaling and overcoming the data wall 27:03 - Multimodal data, synthetic data, and embodied AI 31:05 - Demystifying mid-training and post-training 37:21 - Does RL create new capabilities in AI? 38:53 - The challenges and frontier of scaling RL 43:09 - Is building AI models a craft or a strict science 48:21 - How AI models generalize across different domains 54:18 - How reinforcement learning cures AI hallucinations 56:04 - Negative generalization and conflicting instructions 58:05 - Can RL scale to law, medicine, and the broader economy? 1:00:19 - The evaluation bottleneck and Model as a Judge 1:04:21 - Continuous AI progress & continual learning 1:08:49 - Will foundation models eat the agent harness 1:11:23 - Why startups should focus on the last mile of AI

Why AI Progress Suddenly Feels Real - my conversation with Yann Dubois, who co-leads the Post-Training Frontiers team at OpenAI 00:00 - Intro 01:30 - Why recent AI progress feels like a step function 04:13 - Model reliability & the emotional rollercoaster of shipping GPT-5.5 07:33 - How OpenAI structures vertical and horizontal teams 09:49 - Improving model efficiency and test-time compute 12:32 - Yann's journey from Switzerland to OpenAI 15:37 - Reasoning in 2026: Real-world utility vs verifiable rewards 18:34 - GPT-5.5 Thinking vs Pro: Scaling test-time compute 20:09 - How reasoning models become more efficient 23:23 - Pre-training scaling and overcoming the data wall 27:03 - Multimodal data, synthetic data, and embodied AI 31:05 - Demystifying mid-training and post-training 37:21 - Does RL create new capabilities in AI? 38:53 - The challenges and frontier of scaling RL 43:09 - Is building AI models a craft or a strict science 48:21 - How AI models generalize across different domains 54:18 - How reinforcement learning cures AI hallucinations 56:04 - Negative generalization and conflicting instructions 58:05 - Can RL scale to law, medicine, and the broader economy? 1:00:19 - The evaluation bottleneck and Model as a Judge 1:04:21 - Continuous AI progress & continual learning 1:08:49 - Will foundation models eat the agent harness 1:11:23 - Why startups should focus on the last mile of AI

Matt Turck

100,966 просмотров • 2 месяцев назад

Let's reverse engineer Disney's adorable, lifelike robot! I couldn't find a whitepaper, but this is how I think it's trained: 1. The emotional behaviors are curated by Disney animation artists, keyframe by keyframe. But it cannot be "rendered" directly on the robot because it doesn't take into account the complex real-world physics. 2. Reinforcement learning (RL) is a great tool for training low-level robot controllers. RL needs a reward function to optimize, and it's typically a task reward (e.g. walk in a straight line as fast as possible). The problem is that RL doesn't know what counts as "natural behavior", and often produces weird-looking body postures that somehow still maximize the reward. This is a human alignment problem just like ChatGPT. 3. Enters Adversarial Motion Prior (AMP): a technique that learns the human preference by training a classifier on what we consider "emotional & cute". In GAN literature, this is called a discriminator. Disney artists are good at creating such a dataset. You can then add AMP as an auxiliary reward in simulation to nudge the robot towards desired behaviors. AMP was developed by Peng et al. 2021 and Escontrela et al. 2022. 4. Add lots of data augmentation to make the controller robust to physical disturbances. In RL, it's called "domain randomization". This is a very powerful technique that bridges the gap between simulator and reality. Previously, OpenAI used domain randomization to train a 5-finger robot hand to manipulate a Rubik's Cube: IEEE news article gave hints about the pipeline: Finally, praying for world peace 🙏. I hope robotics like this will bring more joy to the world.

Let's reverse engineer Disney's adorable, lifelike robot! I couldn't find a whitepaper, but this is how I think it's trained: 1. The emotional behaviors are curated by Disney animation artists, keyframe by keyframe. But it cannot be "rendered" directly on the robot because it doesn't take into account the complex real-world physics. 2. Reinforcement learning (RL) is a great tool for training low-level robot controllers. RL needs a reward function to optimize, and it's typically a task reward (e.g. walk in a straight line as fast as possible). The problem is that RL doesn't know what counts as "natural behavior", and often produces weird-looking body postures that somehow still maximize the reward. This is a human alignment problem just like ChatGPT. 3. Enters Adversarial Motion Prior (AMP): a technique that learns the human preference by training a classifier on what we consider "emotional & cute". In GAN literature, this is called a discriminator. Disney artists are good at creating such a dataset. You can then add AMP as an auxiliary reward in simulation to nudge the robot towards desired behaviors. AMP was developed by Peng et al. 2021 and Escontrela et al. 2022. 4. Add lots of data augmentation to make the controller robust to physical disturbances. In RL, it's called "domain randomization". This is a very powerful technique that bridges the gap between simulator and reality. Previously, OpenAI used domain randomization to train a 5-finger robot hand to manipulate a Rubik's Cube: IEEE news article gave hints about the pipeline: Finally, praying for world peace 🙏. I hope robotics like this will bring more joy to the world.

Jim Fan

314,654 просмотров • 2 лет назад

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

The Humanoid Hub

11,575 просмотров • 5 месяцев назад

Not the flashiest demos, but what’s under the hood represents a foundational shift for general-purpose robotics. World models are the next-gen foundation of Physical AI, not the VLM backbones found in typical VLAs. DreamZero is a 14B-parameter World Action Model (WAM) by NVIDIA that treats robotics as a joint video-and-action prediction task. Unlike traditional Vision-Language-Action (VLA) models that map images directly to motor commands, DreamZero leverages a pretrained video diffusion backbone to predict future world states and actions simultaneously. - achieves 2× better zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs. - learns effectively from heterogeneous, non-repetitive data (500 hours), breaking the need for thousands of repeated demonstrations. - adapts to new robot embodiments with just 30 minutes of play data. - enables 7Hz closed-loop control via system optimizations and "DreamZero-Flash," making high-capacity diffusion models viable for real-time use.

Not the flashiest demos, but what’s under the hood represents a foundational shift for general-purpose robotics. World models are the next-gen foundation of Physical AI, not the VLM backbones found in typical VLAs. DreamZero is a 14B-parameter World Action Model (WAM) by NVIDIA that treats robotics as a joint video-and-action prediction task. Unlike traditional Vision-Language-Action (VLA) models that map images directly to motor commands, DreamZero leverages a pretrained video diffusion backbone to predict future world states and actions simultaneously. - achieves 2× better zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs. - learns effectively from heterogeneous, non-repetitive data (500 hours), breaking the need for thousands of repeated demonstrations. - adapts to new robot embodiments with just 30 minutes of play data. - enables 7Hz closed-loop control via system optimizations and "DreamZero-Flash," making high-capacity diffusion models viable for real-time use.

The Humanoid Hub

35,204 просмотров • 5 месяцев назад