Video wird geladen...
Video konnte nicht geladen werden
DPO Debate: Is RL needed for RLHF? All things as we cannot settle if DPO or RL is better. At least it is a good exercise. 1. Derivations in the DPO paper. Hint, the authors are good at math 2. cDPO, IPO, and related equations 3. Speculation on potential... show more
100,027 Aufrufe • vor 2 Jahren •via X (Twitter)
11 Kommentare

here we go again @abacaj @ericmitchellai @srush_nlp @_lewtun @hamishivi @rajammanabrolu @teortaxesTex @yacineMTB @erhartford @Teknium1 @ethayarajh

If the derivation of Eq 4 is a bit esoteric, there's always the more direct (but tedious) approach: form the Lagrangian L(π)=E[r(y)]-βKL(π||π_ref)+λ(∑π(y)-1) and set dL/dπ(y)=0, solve.

Can you post that derivation too?

Nice summary!BTW, please also take a glance at our APO (Adversarial Preference Optimization) at which is an adversarial alignment method and compatible to either DPO/IPO/PPO😃

“Language tasks are trees” I finally found it! Second guessing: what about goal-oriented dialogues, having the ability to backspace, etc.

Goal oriented as you describe it prolly works, but I'm not a theory person! Good point!

@ylecun @natolambert have you read the IPO paper carefully? We had chats with main authors of DPO and they are aware of limitations of DPO, i.e., Bradly Theory assumption and overfitting. DPO is a pioneering work in this field and in IPO we try to improve it by alleviating limitations.

@ylecun I read it a good bit, but definitely not all. Two things stuck out that made me turn away from complete read: * It came off as pretty aggressive towards DPO, which is a different style of paper to read * It still needs a lot of experiments I'm happy to keep learning more

@ylecun Don’t forget the #GazaGenocide and #IsraelCriminalWar and ongoing atrocities

I am not sure expressing quite negative comments about a work (based on softballs like "there is no experiments, which is not true") either help the advancement of the field or help people gain better understanding of the topic.

you're right, followed up via email.
