Loading video...

Video Failed to Load

Go Home

๐Ÿ‘‡Introducing DPPO, Diffusion Policy Policy Optimization DPPO optimizes pre-trained Diffusion Policy using policy gradient from RL, showing ๐˜€๐˜‚๐—ฟ๐—ฝ๐—ฟ๐—ถ๐˜€๐—ถ๐—ป๐—ด ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ๐—บ๐—ฒ๐—ป๐˜๐˜€ over a variety of baselines across benchmarks and sim2real transfer

77,612 views โ€ข 1 year ago โ€ขvia X (Twitter)

11 Comments

Allen Z. Ren's profile picture
Allen Z. Ren1 year ago

DPPO implements Proximal Policy Optimization (PPO) by treating the denoising process as part of a โ€œtwo-layer MDPโ€, making gradients efficiently computable. We improve performance with better advantage estimation + modified denoising schedule to balance exploration and stability

Allen Z. Ren's profile picture
Allen Z. Ren1 year ago

DPPO yields marked improvements in training stability and final performance compared to other diffusion-based RL methods and common policy parameterizations such as Gaussian and Gaussian Mixture, across tasks from Gym, D3IL, Robomimic, and Furniture-Bench

Allen Z. Ren's profile picture
Allen Z. Ren1 year ago

Most remarkably, DPPO achieves zero-shot sim2real transfer in state-based, long-horizon assembly tasks, while Gaussian policy shows significant sim2real gap. DPPO also succeeds in challenging pixel-based benchmark (see next), and we are actively working on pixel sim2real

Allen Z. Ren's profile picture
Allen Z. Ren1 year ago

DPPO solves the challenging Square and Transport tasks in robomimic to >90% success using ๐ž๐ข๐ญ๐ก๐ž๐ซ ๐ฌ๐ญ๐š๐ญ๐ž ๐จ๐ซ ๐ฉ๐ข๐ฑ๐ž๐ฅ input and sparse reward. To our knowledge, DPPO is the first RL algorithm to solve Transport to >50% success rates (from either state or pixel!)

Allen Z. Ren's profile picture
Allen Z. Ren1 year ago

In three multi-stage assembly tasks from Furniture-Bench, One-leg, Lamp, and Round-table, DPPO improves the success rate of pre-trained policies from 57% to 97%, 12% to 87%, and 1% to 86%, respectively, learning from only ๐˜€๐—ฝ๐—ฎ๐—ฟ๐˜€๐—ฒ ๐—ฟ๐—ฒ๐˜„๐—ฎ๐—ฟ๐—ฑ

Allen Z. Ren's profile picture
Allen Z. Ren1 year ago

Why does DPPO work so well? DPPO engages in ๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ๐—ฑ, ๐—ผ๐—ป-๐—บ๐—ฎ๐—ป๐—ถ๐—ณ๐—ผ๐—น๐—ฑ ๐—ฒ๐˜…๐—ฝ๐—น๐—ผ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป, showing wide coverage around expert data due to more structured exploration noise. This improves training efficiency and leads to smooth actions that aid in sim2real

Allen Z. Ren's profile picture
Allen Z. Ren1 year ago

DPPO also yields policies that are robust to perturbations in dynamics and the initial state distribution. Such robustness also allows more extensive domain randomization in simulation to facilitate sim2real transfer

Allen Z. Ren's profile picture
Allen Z. Ren1 year ago

In future, we are excited to unlock the potential of DPPO in multitask settings by exploiting diverse expert data manifolds through structured exploration + combining RL with test-time guidance (e.g. POCO @LiruiWang1), and other architectures (e.g. Diffusion Forcing @BoyuanChen0)

Allen Z. Ren's profile picture
Allen Z. Ren1 year ago

@BoyuanChen0 DPPO comes from a wonderful collaboration with @justinlidard @larsankile @anthonysimeono_ @pulkitology @Majumdar_Ani @Ben_Burchfiel Hongkai Dai + last author @max_simchowitz Give DPPO a try with our code! Website/code: Paper:

Max Simchowitz's profile picture
Max Simchowitz1 year ago

it was such a pleasure working with @allenzren , and the results were shockingly good! very excited to see what comes next :)

Allen Z. Ren's profile picture
Allen Z. Ren1 year ago

My greatest pleasure Max! Loved all the impromptu discussions :) and learned many things from you!

Related Videos