Video wird geladen...
Video konnte nicht geladen werden
If you have a policy that uses diffusion/flow (e.g. diffusion VLA), you can run RL where the actor chooses the noise, which is then denoised by the policy to produce an action. This method, which we call diffusion steering (DSRL), leads to a remarkably efficient RL method! 🧵👇
152,824 Aufrufe • vor 1 Jahr •via X (Twitter)
9 Kommentare

DSRL trains an actor and Q-function, treating the diffusion noise as the action space. Because samples from the noise prior map to reasonable actions for the policy, DSRL essentially explores "inside" the set of reasonable pre-trained behaviors, making it extremely efficient.

DSRL learns essentially in real time, with good results in as little as 50 trials (it's so efficient that a person can literally sit in front of the robot and push a button to assign sparse rewards).

This was a really fun collaboration led by @ajwagenmaker Project website with paper: To find out more, check out his thread here:

would a supervised learning version of this work? where the noise distribution is a parmeter that is also optimized along with policy weights

how long would it take to get that first sparse reward with this method?

Controlling the noise instead of the action itself is a surprisingly effective approach.

This is pretty amazing, and the visualization made everything so easy to understand😅

Will making the initial noise distribution a learnable parameter reduce randomness and thus make the model more prone to overfitting?

😅🥹

