Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

👇Introducing DPPO, Diffusion Policy Policy Optimization DPPO optimizes pre-trained Diffusion Policy using policy gradient from RL, showing 𝘀𝘂𝗿𝗽𝗿𝗶𝘀𝗶𝗻𝗴 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁𝘀 over a variety of baselines across benchmarks and sim2real transfer

77,612 Aufrufe • vor 1 Jahr •via X (Twitter)

11 Kommentare

Profilbild von Allen Z. Ren
Allen Z. Renvor 1 Jahr

DPPO implements Proximal Policy Optimization (PPO) by treating the denoising process as part of a “two-layer MDP”, making gradients efficiently computable. We improve performance with better advantage estimation + modified denoising schedule to balance exploration and stability

Profilbild von Allen Z. Ren
Allen Z. Renvor 1 Jahr

DPPO yields marked improvements in training stability and final performance compared to other diffusion-based RL methods and common policy parameterizations such as Gaussian and Gaussian Mixture, across tasks from Gym, D3IL, Robomimic, and Furniture-Bench

Profilbild von Allen Z. Ren
Allen Z. Renvor 1 Jahr

Most remarkably, DPPO achieves zero-shot sim2real transfer in state-based, long-horizon assembly tasks, while Gaussian policy shows significant sim2real gap. DPPO also succeeds in challenging pixel-based benchmark (see next), and we are actively working on pixel sim2real

Profilbild von Allen Z. Ren
Allen Z. Renvor 1 Jahr

DPPO solves the challenging Square and Transport tasks in robomimic to >90% success using 𝐞𝐢𝐭𝐡𝐞𝐫 𝐬𝐭𝐚𝐭𝐞 𝐨𝐫 𝐩𝐢𝐱𝐞𝐥 input and sparse reward. To our knowledge, DPPO is the first RL algorithm to solve Transport to >50% success rates (from either state or pixel!)

Profilbild von Allen Z. Ren
Allen Z. Renvor 1 Jahr

In three multi-stage assembly tasks from Furniture-Bench, One-leg, Lamp, and Round-table, DPPO improves the success rate of pre-trained policies from 57% to 97%, 12% to 87%, and 1% to 86%, respectively, learning from only 𝘀𝗽𝗮𝗿𝘀𝗲 𝗿𝗲𝘄𝗮𝗿𝗱

Profilbild von Allen Z. Ren
Allen Z. Renvor 1 Jahr

Why does DPPO work so well? DPPO engages in 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱, 𝗼𝗻-𝗺𝗮𝗻𝗶𝗳𝗼𝗹𝗱 𝗲𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻, showing wide coverage around expert data due to more structured exploration noise. This improves training efficiency and leads to smooth actions that aid in sim2real

Profilbild von Allen Z. Ren
Allen Z. Renvor 1 Jahr

DPPO also yields policies that are robust to perturbations in dynamics and the initial state distribution. Such robustness also allows more extensive domain randomization in simulation to facilitate sim2real transfer

Profilbild von Allen Z. Ren
Allen Z. Renvor 1 Jahr

In future, we are excited to unlock the potential of DPPO in multitask settings by exploiting diverse expert data manifolds through structured exploration + combining RL with test-time guidance (e.g. POCO @LiruiWang1), and other architectures (e.g. Diffusion Forcing @BoyuanChen0)

Profilbild von Allen Z. Ren
Allen Z. Renvor 1 Jahr

@BoyuanChen0 DPPO comes from a wonderful collaboration with @justinlidard @larsankile @anthonysimeono_ @pulkitology @Majumdar_Ani @Ben_Burchfiel Hongkai Dai + last author @max_simchowitz Give DPPO a try with our code! Website/code: Paper:

Profilbild von Max Simchowitz
Max Simchowitzvor 1 Jahr

it was such a pleasure working with @allenzren , and the results were shockingly good! very excited to see what comes next :)

Profilbild von Allen Z. Ren
Allen Z. Renvor 1 Jahr

My greatest pleasure Max! Loved all the impromptu discussions :) and learned many things from you!

Ähnliche Videos