Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

DPO Debate: Is RL needed for RLHF? All things as we cannot settle if DPO or RL is better. At least it is a good exercise. 1. Derivations in the DPO paper. Hint, the authors are good at math 2. cDPO, IPO, and related equations 3. Speculation on potential...

100,027 Aufrufe • vor 2 Jahren •via X (Twitter)

11 Kommentare

Profilbild von Nathan Lambert
Nathan Lambertvor 2 Jahren

here we go again @abacaj @ericmitchellai @srush_nlp @_lewtun @hamishivi @rajammanabrolu @teortaxesTex @yacineMTB @erhartford @Teknium1 @ethayarajh

Profilbild von Allan Zhou
Allan Zhouvor 2 Jahren

If the derivation of Eq 4 is a bit esoteric, there's always the more direct (but tedious) approach: form the Lagrangian L(π)=E[r(y)]-βKL(π||π_ref)+λ(∑π(y)-1) and set dL/dπ(y)=0, solve.

Profilbild von Nathan Lambert
Nathan Lambertvor 2 Jahren

Can you post that derivation too?

Profilbild von Pengyu Cheng
Pengyu Chengvor 2 Jahren

Nice summary!BTW, please also take a glance at our APO (Adversarial Preference Optimization) at which is an adversarial alignment method and compatible to either DPO/IPO/PPO😃

Profilbild von RanW
RanWvor 2 Jahren

“Language tasks are trees” I finally found it! Second guessing: what about goal-oriented dialogues, having the ability to backspace, etc.

Profilbild von Nathan Lambert
Nathan Lambertvor 2 Jahren

Goal oriented as you describe it prolly works, but I'm not a theory person! Good point!

Profilbild von Mohammad Azar
Mohammad Azarvor 2 Jahren

@ylecun @natolambert have you read the IPO paper carefully? We had chats with main authors of DPO and they are aware of limitations of DPO, i.e., Bradly Theory assumption and overfitting. DPO is a pioneering work in this field and in IPO we try to improve it by alleviating limitations.

Profilbild von Nathan Lambert
Nathan Lambertvor 2 Jahren

@ylecun I read it a good bit, but definitely not all. Two things stuck out that made me turn away from complete read: * It came off as pretty aggressive towards DPO, which is a different style of paper to read * It still needs a lot of experiments I'm happy to keep learning more

Profilbild von Snowflect
Snowflectvor 2 Jahren

@ylecun Don’t forget the #GazaGenocide and #IsraelCriminalWar and ongoing atrocities

Profilbild von Mohammad Azar
Mohammad Azarvor 2 Jahren

I am not sure expressing quite negative comments about a work (based on softballs like "there is no experiments, which is not true") either help the advancement of the field or help people gain better understanding of the topic.

Profilbild von Nathan Lambert
Nathan Lambertvor 2 Jahren

you're right, followed up via email.

Ähnliche Videos

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 Aufrufe • vor 11 Monaten