Loading video...
Video Failed to Load
๐Introducing DPPO, Diffusion Policy Policy Optimization DPPO optimizes pre-trained Diffusion Policy using policy gradient from RL, showing ๐๐๐ฟ๐ฝ๐ฟ๐ถ๐๐ถ๐ป๐ด ๐ถ๐บ๐ฝ๐ฟ๐ผ๐๐ฒ๐บ๐ฒ๐ป๐๐ over a variety of baselines across benchmarks and sim2real transfer
77,612 views โข 1 year ago โขvia X (Twitter)
11 Comments

DPPO implements Proximal Policy Optimization (PPO) by treating the denoising process as part of a โtwo-layer MDPโ, making gradients efficiently computable. We improve performance with better advantage estimation + modified denoising schedule to balance exploration and stability

DPPO yields marked improvements in training stability and final performance compared to other diffusion-based RL methods and common policy parameterizations such as Gaussian and Gaussian Mixture, across tasks from Gym, D3IL, Robomimic, and Furniture-Bench

Most remarkably, DPPO achieves zero-shot sim2real transfer in state-based, long-horizon assembly tasks, while Gaussian policy shows significant sim2real gap. DPPO also succeeds in challenging pixel-based benchmark (see next), and we are actively working on pixel sim2real

DPPO solves the challenging Square and Transport tasks in robomimic to >90% success using ๐๐ข๐ญ๐ก๐๐ซ ๐ฌ๐ญ๐๐ญ๐ ๐จ๐ซ ๐ฉ๐ข๐ฑ๐๐ฅ input and sparse reward. To our knowledge, DPPO is the first RL algorithm to solve Transport to >50% success rates (from either state or pixel!)

In three multi-stage assembly tasks from Furniture-Bench, One-leg, Lamp, and Round-table, DPPO improves the success rate of pre-trained policies from 57% to 97%, 12% to 87%, and 1% to 86%, respectively, learning from only ๐๐ฝ๐ฎ๐ฟ๐๐ฒ ๐ฟ๐ฒ๐๐ฎ๐ฟ๐ฑ

Why does DPPO work so well? DPPO engages in ๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ๐ฑ, ๐ผ๐ป-๐บ๐ฎ๐ป๐ถ๐ณ๐ผ๐น๐ฑ ๐ฒ๐ ๐ฝ๐น๐ผ๐ฟ๐ฎ๐๐ถ๐ผ๐ป, showing wide coverage around expert data due to more structured exploration noise. This improves training efficiency and leads to smooth actions that aid in sim2real

DPPO also yields policies that are robust to perturbations in dynamics and the initial state distribution. Such robustness also allows more extensive domain randomization in simulation to facilitate sim2real transfer

In future, we are excited to unlock the potential of DPPO in multitask settings by exploiting diverse expert data manifolds through structured exploration + combining RL with test-time guidance (e.g. POCO @LiruiWang1), and other architectures (e.g. Diffusion Forcing @BoyuanChen0)

@BoyuanChen0 DPPO comes from a wonderful collaboration with @justinlidard @larsankile @anthonysimeono_ @pulkitology @Majumdar_Ani @Ben_Burchfiel Hongkai Dai + last author @max_simchowitz Give DPPO a try with our code! Website/code: Paper:

it was such a pleasure working with @allenzren , and the results were shockingly good! very excited to see what comes next :)

My greatest pleasure Max! Loved all the impromptu discussions :) and learned many things from you!
