Loading video...

Video Failed to Load

Go Home

A new way to fine tune: Preference Fine-Tuning. Customize models by comparing responses, rather than fixed targets—ideal for subjective tasks where tone, style, and creativity matter.

199,296 views • 1 year ago •via X (Twitter)

11 Comments

Steven Heidel's profile picture
Steven Heidel1 year ago

great work @andrwpng !

Hummingbird's profile picture
Hummingbird1 year ago

Choose who you want to see your message, by choosing who you want to share your message with! 🔥 Remember—we value all influencers, regardless of their follower count. Charging up influencers and KOLs of all sizes with a new way to leverage audiences is how we do things at Hummingbird. 🌐 Join us as we bring to life a marketing platform that lets your business utilize the power of pre-selected influencers perfectly suited to your target audience. 🚀 #Hummingbird #Web3Community #DigitalMarketing

Finance Wizard's profile picture
Finance Wizard1 year ago

Unban Finance Wizard. You removed alot of top GPTs by accident two days ago and we all had to file appeals to get back up yet I was denied.

Vincent's profile picture
Vincent1 year ago

huge disappointment again. we waited for new smarter models like gemini 2.0 flash

AI Leaks and News's profile picture
AI Leaks and News1 year ago

Can’t wait to try it!

Itay Bachman's profile picture
Itay Bachman1 year ago

when o1 for tier 4?

Callum's profile picture
Callum1 year ago

This is great, will definitely be testing it over the next couple days

Rjay's profile picture
Rjay1 year ago

@OpenAI Making progress!

Renee Bush's profile picture
Renee Bush1 year ago

@OpenAI Amazing

2awake's profile picture
2awake1 year ago

@OpenAI 🤝⚡️

OxyCoffin's profile picture
OxyCoffin1 year ago

you should totally implement this in custom gpts and projects.

Related Videos

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,442 views • 1 year ago

Small Language Models (SML) are the future of AI. "Small" (SML) instead of "Large" (LLM). These small models are highly specialized models with superhuman abilities on specific tasks. Here are two techniques to build these models: • Spectrum • Model Merging I give you a short introduction in the attached video, but here is a quick summary: Spectrum helps us identify the most relevant layers to solve one specific task. We can ignore everything else and focus on fine-tuning these layers. Using Spectrum, we can fine-tune models in a heartbeat. Model Merging combines multiple models into a unique, much better model than any of the individual input models. You can also combine models specialized in different tasks and get a model with multiple abilities. This is the state of the art of productizing models. It's what Arcee.ai's platform does behind the scenes. Arcee collaborated with me on this post and is sponsoring it. There are three main steps to produce a model for your particular use case: 1. You create a dataset by uploading your data. 2. You train a model. At this step, Arcee uses Spectrum and Model Merging to produce a highly specialized model for your task. 3. You can deploy that model to any environment you want. Three important notes: • Training process is 2x faster and 2x cheaper than regular fine-tuning. • Resultant models are smaller and have higher accuracy. • They create these specialized models from open-source models. Check this site so you can fully appreciate how this works: If you want to fine-tune an open-source model, consider Arcee's platform. This is the state of the art.

Santiago

164,162 views • 1 year ago