Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

👇Introducing DPPO, Diffusion Policy Policy Optimization DPPO optimizes pre-trained Diffusion Policy using policy gradient from RL, showing 𝘀𝘂𝗿𝗽𝗿𝗶𝘀𝗶𝗻𝗴 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁𝘀 over a variety of baselines across benchmarks and sim2real transfer

Allen Ren

2,179 subscribers

78,227 views • 1 year ago •via X (Twitter)

Education Arts Science & Technology

Anya Rossi• Live Now

Private livecam show

11 Comments

Allen Z. Ren1 year ago

DPPO implements Proximal Policy Optimization (PPO) by treating the denoising process as part of a “two-layer MDP”, making gradients efficiently computable. We improve performance with better advantage estimation + modified denoising schedule to balance exploration and stability

Allen Z. Ren1 year ago

DPPO yields marked improvements in training stability and final performance compared to other diffusion-based RL methods and common policy parameterizations such as Gaussian and Gaussian Mixture, across tasks from Gym, D3IL, Robomimic, and Furniture-Bench

Allen Z. Ren1 year ago

Most remarkably, DPPO achieves zero-shot sim2real transfer in state-based, long-horizon assembly tasks, while Gaussian policy shows significant sim2real gap. DPPO also succeeds in challenging pixel-based benchmark (see next), and we are actively working on pixel sim2real

Allen Z. Ren1 year ago

DPPO solves the challenging Square and Transport tasks in robomimic to >90% success using 𝐞𝐢𝐭𝐡𝐞𝐫 𝐬𝐭𝐚𝐭𝐞 𝐨𝐫 𝐩𝐢𝐱𝐞𝐥 input and sparse reward. To our knowledge, DPPO is the first RL algorithm to solve Transport to >50% success rates (from either state or pixel!)

Allen Z. Ren1 year ago

In three multi-stage assembly tasks from Furniture-Bench, One-leg, Lamp, and Round-table, DPPO improves the success rate of pre-trained policies from 57% to 97%, 12% to 87%, and 1% to 86%, respectively, learning from only 𝘀𝗽𝗮𝗿𝘀𝗲 𝗿𝗲𝘄𝗮𝗿𝗱

Allen Z. Ren1 year ago

Why does DPPO work so well? DPPO engages in 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱, 𝗼𝗻-𝗺𝗮𝗻𝗶𝗳𝗼𝗹𝗱 𝗲𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻, showing wide coverage around expert data due to more structured exploration noise. This improves training efficiency and leads to smooth actions that aid in sim2real

Allen Z. Ren1 year ago

DPPO also yields policies that are robust to perturbations in dynamics and the initial state distribution. Such robustness also allows more extensive domain randomization in simulation to facilitate sim2real transfer

Allen Z. Ren1 year ago

In future, we are excited to unlock the potential of DPPO in multitask settings by exploiting diverse expert data manifolds through structured exploration + combining RL with test-time guidance (e.g. POCO @LiruiWang1), and other architectures (e.g. Diffusion Forcing @BoyuanChen0)

Allen Z. Ren1 year ago

@BoyuanChen0 DPPO comes from a wonderful collaboration with @justinlidard @larsankile @anthonysimeono_ @pulkitology @Majumdar_Ani @Ben_Burchfiel Hongkai Dai + last author @max_simchowitz Give DPPO a try with our code! Website/code: Paper:

Max Simchowitz1 year ago

it was such a pleasure working with @allenzren , and the results were shockingly good! very excited to see what comes next :)

Allen Z. Ren1 year ago

My greatest pleasure Max! Loved all the impromptu discussions :) and learned many things from you!

Related Videos

So you’ve trained your favorite diffusion/flow based policy, but it’s just not good enough 0-shot. Worry not, in our new work DSRL - we show how to *steer* pre-trained diffusion policies with off-policy RL, improving behavior efficiently enough for direct training in the real world! DSRL retains nice exploration from the base policy, but allows for quick improvement beyond this base policy with RL. The method is frustratingly simple, and super easy to throw on top of your favorite pretrained policy (VLA/diffusion policy, etc). Let’s think about how it works, 🧵 (1/10)

So you’ve trained your favorite diffusion/flow based policy, but it’s just not good enough 0-shot. Worry not, in our new work DSRL - we show how to steer pre-trained diffusion policies with off-policy RL, improving behavior efficiently enough for direct training in the real world! DSRL retains nice exploration from the base policy, but allows for quick improvement beyond this base policy with RL. The method is frustratingly simple, and super easy to throw on top of your favorite pretrained policy (VLA/diffusion policy, etc). Let’s think about how it works, 🧵 (1/10)

Abhishek Gupta

19,084 views • 1 year ago

If you have a policy that uses diffusion/flow (e.g. diffusion VLA), you can run RL where the actor chooses the noise, which is then denoised by the policy to produce an action. This method, which we call diffusion steering (DSRL), leads to a remarkably efficient RL method! 🧵👇

If you have a policy that uses diffusion/flow (e.g. diffusion VLA), you can run RL where the actor chooses the noise, which is then denoised by the policy to produce an action. This method, which we call diffusion steering (DSRL), leads to a remarkably efficient RL method! 🧵👇

Sergey Levine

152,824 views • 1 year ago

Presenting DemoDiffusion: An extremely simple approach enabling a pre-trained 'generalist' diffusion policy to follow a human-demonstration for a novel task during inference One-shot human imitation *without* requiring any paired human-robot data or online RL 🙂 1/n

Presenting DemoDiffusion: An extremely simple approach enabling a pre-trained 'generalist' diffusion policy to follow a human-demonstration for a novel task during inference One-shot human imitation without requiring any paired human-robot data or online RL 🙂 1/n

Homanga Bharadhwaj

32,919 views • 1 year ago

We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇

We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇

Moo Jin Kim

149,398 views • 6 months ago

AI uncertainty is a policy challenge. Gradual vs. fast diffusion scenarios have very different implications for jobs & macro policy. Planning across futures helps govt's avoid hype & blind spots, and build resilience to AI shocks. Learn more:

AI uncertainty is a policy challenge. Gradual vs. fast diffusion scenarios have very different implications for jobs & macro policy. Planning across futures helps govt's avoid hype & blind spots, and build resilience to AI shocks. Learn more:

IMF

12,210 views • 3 months ago

🏗️ Policy Adaptation from Foundation Model Feedback #CVPR2023 Instead of using foundation model as a pre-trained encoder (generator), we use it as a Teacher (discriminator) to tell where our policy did wrong and helps it adapts to new envs and tasks.

🏗️ Policy Adaptation from Foundation Model Feedback #CVPR2023 Instead of using foundation model as a pre-trained encoder (generator), we use it as a Teacher (discriminator) to tell where our policy did wrong and helps it adapts to new envs and tasks.

Xiaolong Wang

24,412 views • 3 years ago

Hierarchical diffusion policy is another step along the journey of making hierarchical next-best pose agents more capable, through introduction of a kinematically-aware low-level diffusion planner.🤖 New work from the Dyson Robot Learning Lab. CVPR 2024

Hierarchical diffusion policy is another step along the journey of making hierarchical next-best pose agents more capable, through introduction of a kinematically-aware low-level diffusion planner.🤖 New work from the Dyson Robot Learning Lab. CVPR 2024

Stephen James

33,928 views • 2 years ago

The most frustrating part of imitation learning is collecting huge amounts of teleop data. But why teleop robots when robots can learn by watching us? Introducing Point Policy, a novel framework that enables robots to learn from human videos without any teleop, sim2real, or RL.

The most frustrating part of imitation learning is collecting huge amounts of teleop data. But why teleop robots when robots can learn by watching us? Introducing Point Policy, a novel framework that enables robots to learn from human videos without any teleop, sim2real, or RL.

Siddhant Haldar

69,056 views • 1 year ago

Cosmos Policy turns a pretrained video diffusion model into a robot controller. Instead of redesigning the architecture, it injects robot state, actions, and values directly as latent frames inside the video model

Cosmos Policy turns a pretrained video diffusion model into a robot controller. Instead of redesigning the architecture, it injects robot state, actions, and values directly as latent frames inside the video model

Robots Digest 🤖

22,933 views • 6 months ago

Diffusion has shown great promise for generating robot **actions**, can it act as a **world model** to generate the future conditioned on actions? In our work led by han qi Haocheng Yin and in collaboration with Yilun Du, we show a **controllable** action-conditioned video diffusion model can produce photorealistic and (near) physics-accurate future predictions. This ability strengthens the policy via: - ranking different action proposals and selecting the best, or - **visual** trajectory optimization by optimizing the action proposals using gradient ascent. Learn more about Generative Predictive Control (GPC) at:

Diffusion has shown great promise for generating robot actions, can it act as a world model to generate the future conditioned on actions? In our work led by han qi Haocheng Yin and in collaboration with Yilun Du, we show a controllable action-conditioned video diffusion model can produce photorealistic and (near) physics-accurate future predictions. This ability strengthens the policy via: - ranking different action proposals and selecting the best, or - visual trajectory optimization by optimizing the action proposals using gradient ascent. Learn more about Generative Predictive Control (GPC) at:

Heng Yang

38,428 views • 1 year ago

The robot climbs stairs🏯, steps over stones 🪨, and runs in the wild🏞️, all in one policy, without any remote control! Our #CVPR2023 Highlight paper achieves this by using RL + a 3D Neural Volumetric Memory (NVM) trained with view synthesis!

The robot climbs stairs🏯, steps over stones 🪨, and runs in the wild🏞️, all in one policy, without any remote control! Our #CVPR2023 Highlight paper achieves this by using RL + a 3D Neural Volumetric Memory (NVM) trained with view synthesis!

Xiaolong Wang

113,006 views • 3 years ago

The town of Natick, MA voted to adopt a new policy to protect illegals ahead of Trump’s presidency. The policy bans town officials from using any funds, resources, or personnel to identify illegals and bars town officials from enforcing immigration law.

The town of Natick, MA voted to adopt a new policy to protect illegals ahead of Trump’s presidency. The policy bans town officials from using any funds, resources, or personnel to identify illegals and bars town officials from enforcing immigration law.

Libs of TikTok

950,381 views • 1 year ago

Excited to introduce COLA: Learning Human–Humanoid Coordination for Collaborative Object Carrying 🤝🤖 COLA makes humanoids truly helpful in human collaboration — capable of carrying objects, pushing carts, or responding to human push commands. It provides a proprioception-only policy for compliant human–humanoid coordination across diverse movement patterns. The core idea is simple yet effective: 👉 Fine-tune a collaborative policy from a locomotion policy using a residual teacher. 👉 Train in simulation and distill to a real-world student policy for deployment. Paper: Project:

Excited to introduce COLA: Learning Human–Humanoid Coordination for Collaborative Object Carrying 🤝🤖 COLA makes humanoids truly helpful in human collaboration — capable of carrying objects, pushing carts, or responding to human push commands. It provides a proprioception-only policy for compliant human–humanoid coordination across diverse movement patterns. The core idea is simple yet effective: 👉 Fine-tune a collaborative policy from a locomotion policy using a residual teacher. 👉 Train in simulation and distill to a real-world student policy for deployment. Paper: Project:

Siyuan Huang

21,932 views • 9 months ago

🔥 Hot release: Aloha unleashed World first demonstration of a robot able to tie shoelaces or hang t-shirts autonomously! They trained a diffusion policy at scale: 26,000 demonstrations over 5 tasks on Aloha 2 robot Retweet if you'd like them to open-source 😝 (video x4) 1/🧵

🔥 Hot release: Aloha unleashed World first demonstration of a robot able to tie shoelaces or hang t-shirts autonomously! They trained a diffusion policy at scale: 26,000 demonstrations over 5 tasks on Aloha 2 robot Retweet if you'd like them to open-source 😝 (video x4) 1/🧵

Remi Cadene

198,426 views • 1 year ago

Tired of teleoperating your robots? We built a way to scale robot datasets without teleop, dynamic simulation, or even robot hardware. Just one smartphone scan + one human hand demo video → thousands of diverse robot trajectories. Trainable by diffusion policy and VLA models as-is. Introducing: Real2Render2Real 👉

Tired of teleoperating your robots? We built a way to scale robot datasets without teleop, dynamic simulation, or even robot hardware. Just one smartphone scan + one human hand demo video → thousands of diverse robot trajectories. Trainable by diffusion policy and VLA models as-is. Introducing: Real2Render2Real 👉

Max Fu

69,381 views • 1 year ago

US FOREIGN POLICY IS A DISASTER Dan Caldwell 🇺🇸 believes US foreign policy over the last 24 years hurt ISRAEL and empowered IRAN. It also cost the US thousands of lives, trillions of dollars, and distracted us from more urgent priorities.

US FOREIGN POLICY IS A DISASTER Dan Caldwell 🇺🇸 believes US foreign policy over the last 24 years hurt ISRAEL and empowered IRAN. It also cost the US thousands of lives, trillions of dollars, and distracted us from more urgent priorities.

Veteran Action

48,729 views • 1 year ago

.Treasury Secretary Scott Bessent on Squawk Box: "Our economic policies strayed from our interests over the past couple of decades, and President Donald J. Trump has just called for a rethink. We have all hands-on deck in the administration, whether it is our foreign policy, our economic policy, and our military policy. Everyone wants to be part of the U.S. economy."

.Treasury Secretary Scott Bessent on Squawk Box: "Our economic policies strayed from our interests over the past couple of decades, and President Donald J. Trump has just called for a rethink. We have all hands-on deck in the administration, whether it is our foreign policy, our economic policy, and our military policy. Everyone wants to be part of the U.S. economy."

Treasury Department

29,484 views • 1 month ago

Introducing: the Industrial Strategy Advisory Council, formed of business leaders and policy experts from across the UK to offer government independent advice as it develops a new, modern industrial strategy 📣 Read more:

Introducing: the Industrial Strategy Advisory Council, formed of business leaders and policy experts from across the UK to offer government independent advice as it develops a new, modern industrial strategy 📣 Read more:

Department for Business and Trade

19,865 views • 1 year ago

Meet BFM-Zero: A Promptable Humanoid Behavioral Foundation Model w/ Unsupervised RL👉 🧩ONE latent space for ALL tasks ⚡Zero-shot goal reaching, tracking, and reward optimization (any reward at test time), from ONE policy 🤖Natural recovery & transition

Meet BFM-Zero: A Promptable Humanoid Behavioral Foundation Model w/ Unsupervised RL👉 🧩ONE latent space for ALL tasks ⚡Zero-shot goal reaching, tracking, and reward optimization (any reward at test time), from ONE policy 🤖Natural recovery & transition

Yitang Li

81,763 views • 8 months ago

Watch Spot crouch, jump, climb boxes and leap across gaps, controlled by a neural network trained with reinforcement learning (RL) and multi-expert distillation. Multiple expert policies were trained and distilled together into a single policy that was fine tuned to improve performance over diverse terrains. This work was inspired by ANYmal’s parkour capabilities. The neural network processes depth data from Spot's sensors to construct an understanding of the environment.

Watch Spot crouch, jump, climb boxes and leap across gaps, controlled by a neural network trained with reinforcement learning (RL) and multi-expert distillation. Multiple expert policies were trained and distilled together into a single policy that was fine tuned to improve performance over diverse terrains. This work was inspired by ANYmal’s parkour capabilities. The neural network processes depth data from Spot's sensors to construct an understanding of the environment.

RAI Institute

14,768 views • 2 months ago