Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

DPO Debate: Is RL needed for RLHF? All things as we cannot settle if DPO or RL is better. At least it is a good exercise. 1. Derivations in the DPO paper. Hint, the authors are good at math 2. cDPO, IPO, and related equations 3. Speculation on potential... show more

Nathan Lambert

89,218 subscribers

100,027 Aufrufe • vor 2 Jahren •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

11 Kommentare

Profilbild von Nathan Lambert

Nathan Lambertvor 2 Jahren

here we go again @abacaj @ericmitchellai @srush_nlp @_lewtun @hamishivi @rajammanabrolu @teortaxesTex @yacineMTB @erhartford @Teknium1 @ethayarajh

Profilbild von Allan Zhou

Allan Zhouvor 2 Jahren

If the derivation of Eq 4 is a bit esoteric, there's always the more direct (but tedious) approach: form the Lagrangian L(π)=E[r(y)]-βKL(π||π_ref)+λ(∑π(y)-1) and set dL/dπ(y)=0, solve.

Profilbild von Nathan Lambert

Nathan Lambertvor 2 Jahren

Can you post that derivation too?

Profilbild von Pengyu Cheng

Pengyu Chengvor 2 Jahren

Nice summary！BTW, please also take a glance at our APO (Adversarial Preference Optimization) at which is an adversarial alignment method and compatible to either DPO/IPO/PPO😃

Profilbild von RanW

RanWvor 2 Jahren

“Language tasks are trees” I finally found it! Second guessing: what about goal-oriented dialogues, having the ability to backspace, etc.

Profilbild von Nathan Lambert

Nathan Lambertvor 2 Jahren

Goal oriented as you describe it prolly works, but I'm not a theory person! Good point!

Profilbild von Mohammad Azar

Mohammad Azarvor 2 Jahren

@ylecun @natolambert have you read the IPO paper carefully? We had chats with main authors of DPO and they are aware of limitations of DPO, i.e., Bradly Theory assumption and overfitting. DPO is a pioneering work in this field and in IPO we try to improve it by alleviating limitations.

Profilbild von Nathan Lambert

Nathan Lambertvor 2 Jahren

@ylecun I read it a good bit, but definitely not all. Two things stuck out that made me turn away from complete read: * It came off as pretty aggressive towards DPO, which is a different style of paper to read * It still needs a lot of experiments I'm happy to keep learning more

Profilbild von Snowflect

Snowflectvor 2 Jahren

@ylecun Don’t forget the #GazaGenocide and #IsraelCriminalWar and ongoing atrocities

Profilbild von Mohammad Azar

Mohammad Azarvor 2 Jahren

I am not sure expressing quite negative comments about a work (based on softballs like "there is no experiments, which is not true") either help the advancement of the field or help people gain better understanding of the topic.

Profilbild von Nathan Lambert

Nathan Lambertvor 2 Jahren

you're right, followed up via email.

Ähnliche Videos

A debate on Saving Scottish Hospitality and this 🤡goes off on a rant about everything other than the subject of the debate. Well done the DPO for getting her telt! #GreensOut #SNPout

A debate on Saving Scottish Hospitality and this 🤡goes off on a rant about everything other than the subject of the debate. Well done the DPO for getting her telt! #GreensOut #SNPout

Marko Polo

22,453 Aufrufe • vor 4 Monaten

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

DeepLearning.AI

16,771 Aufrufe • vor 11 Monaten

“The kídnåp victims at Enugu-Ezike all of them paid rånsom for their release. How come the Council Chairman and the DPO invited them to take pictures, claiming they were the ones who rescued them, when rånsom had already been paid?” -Man Calls Out Council Chairman and DPO

“The kídnåp victims at Enugu-Ezike all of them paid rånsom for their release. How come the Council Chairman and the DPO invited them to take pictures, claiming they were the ones who rescued them, when rånsom had already been paid?” -Man Calls Out Council Chairman and DPO

CHUKS 🍥

24,048 Aufrufe • vor 1 Jahr

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 Aufrufe • vor 11 Monaten

Suspected nappers found in the bush of edo state claims the DPO OF POLICE brought them in

Suspected nappers found in the bush of edo state claims the DPO OF POLICE brought them in

VERYDARKMAN

25,162 Aufrufe • vor 4 Monaten

"The DPO asked Harrison how much he took. He told the DPO that he did not know the exact amount. The Chief of Staff is the reason why they let Harrison go. He needs to be investigated because he obstructed justice." - VDM drops his final video on Harrison

"The DPO asked Harrison how much he took. He told the DPO that he did not know the exact amount. The Chief of Staff is the reason why they let Harrison go. He needs to be investigated because he obstructed justice." - VDM drops his final video on Harrison

The Benin Blogger

80,042 Aufrufe • vor 5 Monaten

The people of Igbara in Ondo State have gathered to prøtest against the Divisional Police Officer (DPO) of Igbara Oke, citing allegations of extørtion and miscønduct🧎😭

The people of Igbara in Ondo State have gathered to prøtest against the Divisional Police Officer (DPO) of Igbara Oke, citing allegations of extørtion and miscønduct🧎😭

CHUKS 🍥

10,926 Aufrufe • vor 1 Jahr

It took just 30 minutes to beat the shit out of DPO attock’s police

It took just 30 minutes to beat the shit out of DPO attock’s police

Faizan

109,444 Aufrufe • vor 1 Jahr

📢 Introducing DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models Compared to vanilla DPO, we improve paired data construction and preference label granularity, leading to better visual quality and motion strength with only 1/3 of the data. 🧵

📢 Introducing DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models Compared to vanilla DPO, we improve paired data construction and preference label granularity, leading to better visual quality and motion strength with only 1/3 of the data. 🧵

Ziyi Wu

35,402 Aufrufe • vor 1 Jahr

“A DPO At a Police Station In Benin Decided To Convert Part Of The Station Building Into Shops For Rent.” ~ Diamond Of Mock News

“A DPO At a Police Station In Benin Decided To Convert Part Of The Station Building Into Shops For Rent.” ~ Diamond Of Mock News

Somto Okonkwo

58,574 Aufrufe • vor 4 Monaten

The Kano state Government must make sure the killers of the DPO are brought to justice, it’s not only when it happens in the south that everyone shout and cry for justice. The Uromi Massacre live rent free in my head and I can remember how quick the Kano state government put fire on Edo state government to bring the killers to justice. The boys involved in the Uromi massacre were all arrested within 48hours. Same energy must be put into arresting the boys involved in killing of the DPO. We must not hear that they are unknown Kano Fulani Hunters.

The Kano state Government must make sure the killers of the DPO are brought to justice, it’s not only when it happens in the south that everyone shout and cry for justice. The Uromi Massacre live rent free in my head and I can remember how quick the Kano state government put fire on Edo state government to bring the killers to justice. The boys involved in the Uromi massacre were all arrested within 48hours. Same energy must be put into arresting the boys involved in killing of the DPO. We must not hear that they are unknown Kano Fulani Hunters.

Mr Ben 🇳🇬 🇨🇦 🇬🇧

31,471 Aufrufe • vor 1 Jahr

Another member of the 2-Baba led gang that killed DPO Bako of Ahoada-East division of Nigerian Police in Rivers State has been caught.

Another member of the 2-Baba led gang that killed DPO Bako of Ahoada-East division of Nigerian Police in Rivers State has been caught.

Port Harcourt People

300,294 Aufrufe • vor 2 Jahren

Our course recommendation of the day is “Post-training of LLMs, ” where you’ll learn how to customize pre-trained language models using Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL). You'll learn when to use each method, how to curate training data, and implement them in code to shape model behavior effectively. Enroll here:

Our course recommendation of the day is “Post-training of LLMs, ” where you’ll learn how to customize pre-trained language models using Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL). You'll learn when to use each method, how to curate training data, and implement them in code to shape model behavior effectively. Enroll here:

DeepLearning.AI

29,369 Aufrufe • vor 8 Monaten

The DPO later bought a new tyre for the driver of the truck, whose tyre was shot because of 500 Naira.

The DPO later bought a new tyre for the driver of the truck, whose tyre was shot because of 500 Naira.

ÓMÒÉLÉRÍNJÁRÉ

323,384 Aufrufe • vor 1 Jahr

“Abuja Police DPO Did This To Him”-Man

“Abuja Police DPO Did This To Him”-Man

Somto Okonkwo

14,370 Aufrufe • vor 5 Monaten

UPDATE: Emergency services arrived on time at the scene. All 3 people trapped have been rescued to the hospital. All in stable condition. The collapsed building has been barricaded by authorities. . Special thanks to the Ogun State Police Command, DSP OLUSEYI BABASEYI, the DPO of Mowe, Ogun State Fire Service, and other emergency services for the quick response.

UPDATE: Emergency services arrived on time at the scene. All 3 people trapped have been rescued to the hospital. All in stable condition. The collapsed building has been barricaded by authorities. . Special thanks to the Ogun State Police Command, DSP OLUSEYI BABASEYI, the DPO of Mowe, Ogun State Fire Service, and other emergency services for the quick response.

Man of Letters.

14,174 Aufrufe • vor 1 Monat

UPDATE: We have about two to three teams of the Nigeria Police Force Officers who have been identified for discreditable conduct, we have officers and their Dpo from Satellite Lagos State Police Command with me as i speak with you, they were shown on twitter harassing some young men: IGP TUNJI DISU

UPDATE: We have about two to three teams of the Nigeria Police Force Officers who have been identified for discreditable conduct, we have officers and their Dpo from Satellite Lagos State Police Command with me as i speak with you, they were shown on twitter harassing some young men: IGP TUNJI DISU

POLICE COMPLAINT

349,624 Aufrufe • vor 2 Monaten

“How My Son Showed Up For Career Day At School Dressed As a Nigerian Police Officer. Future DPO And IGP😅!” ~ Woman Reacts

“How My Son Showed Up For Career Day At School Dressed As a Nigerian Police Officer. Future DPO And IGP😅!” ~ Woman Reacts

Somto Okonkwo

99,735 Aufrufe • vor 3 Monaten