Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Reinforcement Learning from Human Feedback (RLHF) is gaining traction. This field aims to make AI more responsible by including human values and preferences. In this video, Nathan Lambert, a research scientist and RLHF team lead at Hugging Face explores its inner workings, applications and industry impact. RLHF has gained... the spotlight in recent years. The growth of language models like Anthropic’s Claude and OpenAI's ChatGPT have increased interest in human-feedback integration. "There are some rumors that Open AI had two teams; one was doing RLHF and the other instruction fine-tuning. And the RLHF team kept getting more and more performance." Understanding RLHF The RLHF process has three main steps: Pre-training: Much like with GPT models, the journey starts with pre-training on a large corpus of data. This can range from text data, web scrapes, to specialized datasets. Reward Modeling: This is the RLHF counterpart of supervised fine-tuning in large language models. This stage involves creating a reward model that resonates with human values and preferences. RL Optimization: This stage parallels reward modeling and reinforcement learning in traditional AI models. The AI system fine-tunes itself based on the reward model, employing reinforcement learning algorithms for that extra layer of optimization. The Data Challenge Data collection and curation in RLHF closely resemble the challenges you'd encounter in large language model training. Datasets from organizations like OpenAI can serve as a useful foundation. However, the need for high-quality, task-specific data cannot be overstated. Implementing RLHF: A Practical Guide If you’re someone who loves getting hands-on with AI libraries like Hugging Face, implementing RLHF is right way to do. It’s essential to understand its limitations. Think about model stability, over-optimization, and exploration strategies, much like you would when prompt engineering. Ongoing Research and Next Steps While he suggests that some basics figured out, there are layers of complexity that still need to be unraveled: 1. New Benchmarks: How do we measure the effectiveness of RLHF? 2. Preference Modeling: How can the model be made to understand human preferences better? 3. Interpreting RLHF: Much like explainability in traditional models, how do we make RLHF more interpretable? 4. System-Wide Evaluation: Going beyond individual performance, how does RLHF affect an entire system? The Transformative Power of RLHF Whether you're an AI developer, a business analyst, or a marketer, RLHF promises to revolutionize your domain. Imagine customer service chatbots that understand human emotions better, or content generators that align more closely with human values. RLHF is an emerging field that focuses on enhancing machine learning models through human feedback. While it tackles important issues like bias and ethics, its broader goal is to improve system performance across various applications. Whether you're deeply invested in the ethics of AI or simply curious about advancements in machine learning, RLHF offers valuable insights. If you're interested in the next wave of AI development, this area is definitely one to watch.show more

Muratcan Koylan

20,192 subscribers

27,005 Aufrufe • vor 2 Jahren •via X (Twitter)

Wissenschaft & Technologie Bildung Gesundheit & Wellness

Anya Rossi• Live Now

Private livecam show

8 Kommentare

Profilbild von Muratcan Koylan

Muratcan Koylanvor 2 Jahren

@huggingface You can watch the entire video here. [Invited talk by Nathan Lambert on March 9, 2023 at UCL DARK.]

Profilbild von Muratcan Koylan

Muratcan Koylanvor 2 Jahren

If you are interested in prompt engineering and LLM models, I highly recommend:

Profilbild von Nicolas

Nicolasvor 2 Jahren

@natolambert @huggingface Thank you for sharing this insightful content. Is truly fascinating to see how AI is evolving to incorporate human values and preferences

Profilbild von Muratcan Koylan

Muratcan Koylanvor 2 Jahren

@natolambert @huggingface Thank you for your support, Nicolas! Always a pleasure! My last rt indicates how RLHF can be automated by ai agents and some researchers claim that the results are better than human feedback 🫨

Profilbild von Nathan Lambert

Nathan Lambertvor 2 Jahren

@huggingface Here are some much, much more recent talks covering RLHF. Thanks for sharing my work!

Profilbild von Muratcan Koylan

Muratcan Koylanvor 2 Jahren

@huggingface Thanks for the contribution to the community, Nathan 🙏🏻

Profilbild von GAIO

GAIOvor 2 Jahren

@natolambert @huggingface RLHF is similar to the psychology of Pavlov’s dog. but how do you incentivize / reward an AI? what are AI treats?

Profilbild von Muratcan Koylan

Muratcan Koylanvor 2 Jahren

Pavlov uses food as a treat. AI models receives numerical rewards to adjust the internal weights and biases of the model to improve its performance. Think like "good" or "bad" outcomes. Human Feedback in RLHF, human reviewers rank them based on quality. With enough feedback, the model gets better at producing the desired outputs. Here you can find more details about the possibilities and limitations of RLHF:

Ähnliche Videos

New short course on Reinforcement Learning from Human Feedback! RLHF is one of the key techniques that led to the rise of modern LLMs. It is used to align LLMs with human preferences, to make them more honest, helpful and harmless, by (i) learning a reward function that mimics human preferences, as expressed in human-provided labels, then, (ii) tuning an LLM to generate outputs that receive a high reward. In this course, taught by Nikita Namjoshi, Developer Advocate for GenAI at Google Cloud, you'll learn the details of how RLHF works, including how to apply it to tune an LLM for your own applications. You'll also use an open source library to tune a base LLM to align with human preferences expressed in a training set, and evaluate the tuned model by comparing its responses before and after RLHF-tuning. Please sign up here!

New short course on Reinforcement Learning from Human Feedback! RLHF is one of the key techniques that led to the rise of modern LLMs. It is used to align LLMs with human preferences, to make them more honest, helpful and harmless, by (i) learning a reward function that mimics human preferences, as expressed in human-provided labels, then, (ii) tuning an LLM to generate outputs that receive a high reward. In this course, taught by Nikita Namjoshi, Developer Advocate for GenAI at Google Cloud, you'll learn the details of how RLHF works, including how to apply it to tune an LLM for your own applications. You'll also use an open source library to tune a base LLM to align with human preferences expressed in a training set, and evaluate the tuned model by comparing its responses before and after RLHF-tuning. Please sign up here!

Andrew Ng

205,527 Aufrufe • vor 2 Jahren

An exciting new course: Fine-tuning and Reinforcement Learning for LLMs: Intro to Post-training, taught by Sharon Zhou, VP of AI at AMD. Available now at Post-training is the key technique used by frontier labs to turn a base LLM--a model trained on massive unlabeled text to predict the next word/token--into a helpful, reliable assistant that can follow instructions. I've also seen many applications where post-training is what turns a demo application that works only 80% of the time into a reliable system that consistently performs. This course will teach you the most important post-training techniques! In this 5 module course, Sharon walks you through the complete post-training pipeline: supervised fine-tuning, reward modeling, RLHF, and techniques like PPO and GRPO. You'll also learn to use LoRA for efficient training, and to design evals that catch problems before and after deployment. Skills you'll gain: - Apply supervised fine-tuning and reinforcement learning (RLHF, PPO, GRPO) to align models to desired behaviors - Use LoRA for efficient fine-tuning without retraining entire models - Prepare datasets and generate synthetic data for post-training - Understand how to operate LLM production pipelines, with go/no-go decision points and feedback loops These advanced methods aren’t limited to frontier AI labs anymore, and you can now use them in your own applications. Learn here:

An exciting new course: Fine-tuning and Reinforcement Learning for LLMs: Intro to Post-training, taught by Sharon Zhou, VP of AI at AMD. Available now at Post-training is the key technique used by frontier labs to turn a base LLM--a model trained on massive unlabeled text to predict the next word/token--into a helpful, reliable assistant that can follow instructions. I've also seen many applications where post-training is what turns a demo application that works only 80% of the time into a reliable system that consistently performs. This course will teach you the most important post-training techniques! In this 5 module course, Sharon walks you through the complete post-training pipeline: supervised fine-tuning, reward modeling, RLHF, and techniques like PPO and GRPO. You'll also learn to use LoRA for efficient training, and to design evals that catch problems before and after deployment. Skills you'll gain: - Apply supervised fine-tuning and reinforcement learning (RLHF, PPO, GRPO) to align models to desired behaviors - Use LoRA for efficient fine-tuning without retraining entire models - Prepare datasets and generate synthetic data for post-training - Understand how to operate LLM production pipelines, with go/no-go decision points and feedback loops These advanced methods aren’t limited to frontier AI labs anymore, and you can now use them in your own applications. Learn here:

Andrew Ng

132,304 Aufrufe • vor 8 Monaten

Crazy times. Journalists are losing their jobs at a rapid rate - and are being hired to run RLHF. Ex-journalists are training LLMs to make journalists redundant even faster. The question is how long LLMs will even need RLHF to this extent.

Crazy times. Journalists are losing their jobs at a rapid rate - and are being hired to run RLHF. Ex-journalists are training LLMs to make journalists redundant even faster. The question is how long LLMs will even need RLHF to this extent.

Chubby♨️

41,645 Aufrufe • vor 1 Jahr

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,442 Aufrufe • vor 1 Jahr

DPO Debate: Is RL needed for RLHF? All things as we cannot settle if DPO or RL is better. At least it is a good exercise. 1. Derivations in the DPO paper. Hint, the authors are good at math 2. cDPO, IPO, and related equations 3. Speculation on potential oddities of DPO vs RL 4. Reminders on the state of open RLHF tldr: we have more limitations with data and tooling and evaluation than optimizer choice Slides: Recent blog post of mine on DPO (more next Wed.): DPO Paper: On youtube:

DPO Debate: Is RL needed for RLHF? All things as we cannot settle if DPO or RL is better. At least it is a good exercise. 1. Derivations in the DPO paper. Hint, the authors are good at math 2. cDPO, IPO, and related equations 3. Speculation on potential oddities of DPO vs RL 4. Reminders on the state of open RLHF tldr: we have more limitations with data and tooling and evaluation than optimizer choice Slides: Recent blog post of mine on DPO (more next Wed.): DPO Paper: On youtube:

Nathan Lambert

100,027 Aufrufe • vor 2 Jahren

🔥Introducing Rodin Gen-1 with RLHF 🏅the first time to bring #RLHF to #3D Generation, powered by over 100K generations and labels from our in-house artist team.💪🏃💨 (It’s not #Rodin Gen-1.5🤔 Try Rodin Gen-1 with RLHF now at #CG #GenerativeAI

🔥Introducing Rodin Gen-1 with RLHF 🏅the first time to bring #RLHF to #3D Generation, powered by over 100K generations and labels from our in-house artist team.💪🏃💨 (It’s not #Rodin Gen-1.5🤔 Try Rodin Gen-1 with RLHF now at #CG #GenerativeAI

Hyper3D by Deemos

23,848 Aufrufe • vor 1 Jahr

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

Cassidy Laidlaw

29,734 Aufrufe • vor 1 Jahr

Geoffrey Hinton says AI models understand in the same way that people do and the best model we have of how the human brain works is large language models

Geoffrey Hinton says AI models understand in the same way that people do and the best model we have of how the human brain works is large language models

Tsarathustra

59,171 Aufrufe • vor 1 Jahr

9 Key AI Concepts Explained in 7 minutes - Tokenization - Text Decoding - Prompt Engineering - Multi Step AI Agents - RAGs - RLHF - VAE - Diffusion Models - LoRA

9 Key AI Concepts Explained in 7 minutes - Tokenization - Text Decoding - Prompt Engineering - Multi Step AI Agents - RAGs - RLHF - VAE - Diffusion Models - LoRA

Bytebytego

98,466 Aufrufe • vor 4 Monaten

Optimistic path for AI with Laura Coates on CNN : We've invested almost nothing in AI alignment, but that small amount has catalyzed AI's biggest breakthroughs RLHF gave us ChatGPT. Constitutional AI made models far more trustworthy Just need smart R&D to get there

Optimistic path for AI with Laura Coates on CNN : We've invested almost nothing in AI alignment, but that small amount has catalyzed AI's biggest breakthroughs RLHF gave us ChatGPT. Constitutional AI made models far more trustworthy Just need smart R&D to get there

Judd Rosenblatt

22,053 Aufrufe • vor 1 Jahr

BREAKING: AI has eaten the Internet. Data labeling is so over. & $30 trillion of human work is on the verge of automation. Inside The $2.2B AI Research Accelerator, Turing Founder & CEO, Jonathan Siddharth (), joins Sourcery to break down the severe power shift in AI training: from commodity data labeling → expert research Positioning Turing apart from AI data providers like Scale AI, Mercor, & Surge. (00:00) AI Ate The Internet (00:49) Training Superintelligence: the race to AGI (02:31) Viral tweet (03:24) What Turing actually does (04:43) The internet data is “used up” — where will new data come from? (05:34) Four pillars of superintelligence: multimodality, reasoning, tool use, coding (06:07) Automating $30T of global knowledge work (09:18) The $1B revenue opportunity (10:59) Why Turing is a research-first accelerator, not a data labeler (13:45) Jonathan’s Stanford AI Lab roots & founding DNA (17:57) How models are built: pre-training vs. post-training (20:14) RLHF, reinforcement learning, & “breaking the models” (25:19) GPT-5 and the myth of rapid takeoff (30:46) Safety debates and human-in-the-loop systems (34:53) Closing Enterprise Gap: finance, insurance, & pharma (39:23) Why proprietary enterprise data is the next moat in AI

BREAKING: AI has eaten the Internet. Data labeling is so over. & $30 trillion of human work is on the verge of automation. Inside The $2.2B AI Research Accelerator, Turing Founder & CEO, Jonathan Siddharth (), joins Sourcery to break down the severe power shift in AI training: from commodity data labeling → expert research Positioning Turing apart from AI data providers like Scale AI, Mercor, & Surge. (00:00) AI Ate The Internet (00:49) Training Superintelligence: the race to AGI (02:31) Viral tweet (03:24) What Turing actually does (04:43) The internet data is “used up” — where will new data come from? (05:34) Four pillars of superintelligence: multimodality, reasoning, tool use, coding (06:07) Automating $30T of global knowledge work (09:18) The $1B revenue opportunity (10:59) Why Turing is a research-first accelerator, not a data labeler (13:45) Jonathan’s Stanford AI Lab roots & founding DNA (17:57) How models are built: pre-training vs. post-training (20:14) RLHF, reinforcement learning, & “breaking the models” (25:19) GPT-5 and the myth of rapid takeoff (30:46) Safety debates and human-in-the-loop systems (34:53) Closing Enterprise Gap: finance, insurance, & pharma (39:23) Why proprietary enterprise data is the next moat in AI

Molly O’Shea

69,876 Aufrufe • vor 8 Monaten

Revolutionizing Move Programming with OpenLedger In this demo, we showcase how Move datasets contributed by data providers to OpenLedger’s datanets are used to fine-tune specialized models with LoRA fine-tuning. As seen in the video, we showcase an example on how builders can deploy a Move-specialized model that powers Co-pilot agents using our no-code model fine-tuning platform. This is the future of AI and Web3 innovation. Watch this space to see more specialised models and data feeds being built for next generation agents on top of OpenLedger #Move

Revolutionizing Move Programming with OpenLedger In this demo, we showcase how Move datasets contributed by data providers to OpenLedger’s datanets are used to fine-tune specialized models with LoRA fine-tuning. As seen in the video, we showcase an example on how builders can deploy a Move-specialized model that powers Co-pilot agents using our no-code model fine-tuning platform. This is the future of AI and Web3 innovation. Watch this space to see more specialised models and data feeds being built for next generation agents on top of OpenLedger #Move

OpenLedger

61,662 Aufrufe • vor 1 Jahr

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant wasn't trained with RLHF. Instead, it's powered by assistance games, a better path forward for building AI assistants. 🧵

Cassidy Laidlaw

490,290 Aufrufe • vor 1 Jahr

Karpathy's prediction about RL is coming true now! He called reward functions unreliable and argued that a single reward number is too low-dimensional to teach an agent what "good" means for complex tasks. To solve this, Agents need a knowledge-guided review as a higher-dimensional feedback channel. Every major AI lab trains models with RL today (OpenAI, Anthropic, DeepSeek). And their key bottleneck has always been the reward functions. GRPO by DeepSeek worked well for math and code because the environment gave a binary signal. But for real agent tasks, someone still has to hand-code the scoring function. That takes days and breaks every time the pipeline changes. RULER (implemented in OpenPipe ART, 10k stars) addresses the exact problem Karpathy identified. The reward criteria are defined in plain English, and an LLM evaluates each trajectory against that description to provide feedback for training. I trained a Qwen3 1.4B agent that plays 2048 using GRPO with this exact workflow. In this case, the agent saw the board, picked a direction, and RULER evaluated the outcome, all from this natural language definition. You can see the full implementation on GitHub and try it yourself. Here's the ART Repo: (don't forget to star it ⭐ ) Just like RLHF replaced manual rankings and GRPO replaced the critic model, natural language rewards are replacing hand-coded scoring functions. RL reward engineering is now prompt engineering. I wrote a full walkthrough covering RL for LLM agents, from RLHF to GRPO to RULER, in the article below.

Karpathy's prediction about RL is coming true now! He called reward functions unreliable and argued that a single reward number is too low-dimensional to teach an agent what "good" means for complex tasks. To solve this, Agents need a knowledge-guided review as a higher-dimensional feedback channel. Every major AI lab trains models with RL today (OpenAI, Anthropic, DeepSeek). And their key bottleneck has always been the reward functions. GRPO by DeepSeek worked well for math and code because the environment gave a binary signal. But for real agent tasks, someone still has to hand-code the scoring function. That takes days and breaks every time the pipeline changes. RULER (implemented in OpenPipe ART, 10k stars) addresses the exact problem Karpathy identified. The reward criteria are defined in plain English, and an LLM evaluates each trajectory against that description to provide feedback for training. I trained a Qwen3 1.4B agent that plays 2048 using GRPO with this exact workflow. In this case, the agent saw the board, picked a direction, and RULER evaluated the outcome, all from this natural language definition. You can see the full implementation on GitHub and try it yourself. Here's the ART Repo: (don't forget to star it ⭐ ) Just like RLHF replaced manual rankings and GRPO replaced the critic model, natural language rewards are replacing hand-coded scoring functions. RL reward engineering is now prompt engineering. I wrote a full walkthrough covering RL for LLM agents, from RLHF to GRPO to RULER, in the article below.

Avi Chawla

348,877 Aufrufe • vor 1 Monat

[RLHF] by Hand ✍️ Yesterday, Jan Leike (Jan Leike) announced he is joining #Anthropic to lead their "super-alignment" mission. He is the co-inventor of Reinforcement Learning with Human Feedback (#RLHF). How does RLHF work? [1] Given ↳ Reward Model (RM) ↳ Large Language Model (LLM) ↳ Two (Prompt, Next) Pairs 🟪 TRAIN RM Goal: Learn to give higher rewards to winners [2] Preferences ↳ A human reviews the two pairs and picks a "winner" ↳ (doc is, him) Embeddings ↳ This prompt has never received human feedback directly ↳ [S] is the special start symbol [11] Transformer ↳ Attention (yellow) ↳ Feed Forward (4x2 weight and bias matrix) ↳ Output: 3 "transformed" feature vector, one per position ↳ More details in my previous post 8. Transformer [] [12] Output Probabilities ↳ Apply a linear layer to map each transformed feature vector to a probability distribution over the vocabulary. [13] Sample ↳ Apply the greedy method, which is to pick the word with the highest score ↳ For output 1 and 2, the model accurately predicts the next word ↳ For 3rd output position, the model's predicts "him" [14] Reward Model ↳ The new pair (CEO is, him) is fed to the reward model ↳ The process is same as [3]-[6] ↳ Output: Reward = 3 [15] Loss Gradient ↳ We set the loss as the negative of the reward. ↳ The loss gradient is simply a constant -1. ↳ Run backpropagation and gradient descent to update LLM's weights and biases (red border)

[RLHF] by Hand ✍️ Yesterday, Jan Leike (Jan Leike) announced he is joining #Anthropic to lead their "super-alignment" mission. He is the co-inventor of Reinforcement Learning with Human Feedback (#RLHF). How does RLHF work? [1] Given ↳ Reward Model (RM) ↳ Large Language Model (LLM) ↳ Two (Prompt, Next) Pairs 🟪 TRAIN RM Goal: Learn to give higher rewards to winners [2] Preferences ↳ A human reviews the two pairs and picks a "winner" ↳ (doc is, him) Embeddings ↳ This prompt has never received human feedback directly ↳ [S] is the special start symbol [11] Transformer ↳ Attention (yellow) ↳ Feed Forward (4x2 weight and bias matrix) ↳ Output: 3 "transformed" feature vector, one per position ↳ More details in my previous post 8. Transformer [] [12] Output Probabilities ↳ Apply a linear layer to map each transformed feature vector to a probability distribution over the vocabulary. [13] Sample ↳ Apply the greedy method, which is to pick the word with the highest score ↳ For output 1 and 2, the model accurately predicts the next word ↳ For 3rd output position, the model's predicts "him" [14] Reward Model ↳ The new pair (CEO is, him) is fed to the reward model ↳ The process is same as [3]-[6] ↳ Output: Reward = 3 [15] Loss Gradient ↳ We set the loss as the negative of the reward. ↳ The loss gradient is simply a constant -1. ↳ Run backpropagation and gradient descent to update LLM's weights and biases (red border)

Tom Yeh

79,758 Aufrufe • vor 2 Jahren

Important announcement (with job opportunities!): I’m thrilled to share that I just joined Lila Sciences as SVP of Open-Endedness! Lila is a new name in the AI space, but one you will be hearing a lot from. Their unique mission to pursue Scientific Superintelligence could not align better with my interest in open-ended creativity. Lila is about the entire scientific enterprise, not just a single constrained domain like drug discovery. And of course, the history of scientific progress is an unmistakable precedent for open-ended discovery and Why Greatness Cannot Be Planned. For AI itself to join this epic quest beyond just a supporting role, to create new magic that the human mind has yet to imagine, we will need to take creativity seriously well beyond the conventional pillars of more data, more compute, and more time on inference. That may be where the frontier labs currently are, but if you know about open-endedness, you know that open-ended creativity is not about taking a test and getting a good score. Lila understands the difference and is willing to invest to help me create the best open-endedness team in the world (as part of their overall AI effort) to make it happen. So I want to invite you, if you think this opportunity sounds as exciting as I do, to reach out if you’re interested in joining my team. I’m looking for a diverse range of expertise: pre-training, fine-tuning, RLHF, distillation, mechanistic interpretability, and yes - quality diversity techniques! We’re going to do things on my team that nobody else is doing. This will not be the usual roadmap. Compensation will be industry competitive and our team will be based in San Francisco with a hybrid work schedule. 1/n

Important announcement (with job opportunities!): I’m thrilled to share that I just joined Lila Sciences as SVP of Open-Endedness! Lila is a new name in the AI space, but one you will be hearing a lot from. Their unique mission to pursue Scientific Superintelligence could not align better with my interest in open-ended creativity. Lila is about the entire scientific enterprise, not just a single constrained domain like drug discovery. And of course, the history of scientific progress is an unmistakable precedent for open-ended discovery and Why Greatness Cannot Be Planned. For AI itself to join this epic quest beyond just a supporting role, to create new magic that the human mind has yet to imagine, we will need to take creativity seriously well beyond the conventional pillars of more data, more compute, and more time on inference. That may be where the frontier labs currently are, but if you know about open-endedness, you know that open-ended creativity is not about taking a test and getting a good score. Lila understands the difference and is willing to invest to help me create the best open-endedness team in the world (as part of their overall AI effort) to make it happen. So I want to invite you, if you think this opportunity sounds as exciting as I do, to reach out if you’re interested in joining my team. I’m looking for a diverse range of expertise: pre-training, fine-tuning, RLHF, distillation, mechanistic interpretability, and yes - quality diversity techniques! We’re going to do things on my team that nobody else is doing. This will not be the usual roadmap. Compensation will be industry competitive and our team will be based in San Francisco with a hybrid work schedule. 1/n

Kenneth Stanley

63,058 Aufrufe • vor 1 Jahr

New course! Generative AI with Large Language Models, created with Amazon Web Services and hosted on Coursera. This course goes deep into the technical foundations of LLMs and how to use them. You can sign up here: You’ll work through the full life-cycle of a generative AI project, and learn specific techniques like RLHF; zero-shot, one-shot, and few-shot learning with LLMs; advanced prompting frameworks like ReAct; even fine-tuning LLMs, and gain hands-on practice with all of these techniques. Instructors Antje Barth Chris Fregly Shelbee Eigenbrode and Mike G Chambers all do incredible Generative AI work at AWS, and have supported many companies to build creative LLM applications. They bring tremendous practical LLM expertise to this course. I'm confident you’ll finish this course with a deeper understanding of how LLMs work, and how to use them. I hope you enjoy the course!

New course! Generative AI with Large Language Models, created with Amazon Web Services and hosted on Coursera. This course goes deep into the technical foundations of LLMs and how to use them. You can sign up here: You’ll work through the full life-cycle of a generative AI project, and learn specific techniques like RLHF; zero-shot, one-shot, and few-shot learning with LLMs; advanced prompting frameworks like ReAct; even fine-tuning LLMs, and gain hands-on practice with all of these techniques. Instructors Antje Barth Chris Fregly Shelbee Eigenbrode and Mike G Chambers all do incredible Generative AI work at AWS, and have supported many companies to build creative LLM applications. They bring tremendous practical LLM expertise to this course. I'm confident you’ll finish this course with a deeper understanding of how LLMs work, and how to use them. I hope you enjoy the course!

Andrew Ng

467,875 Aufrufe • vor 3 Jahren

Why AI Can Now Make Discoveries - my conversation with Dan Roberts, Lead of the Foundations of Reinforcement Learning team at OpenAI 00:00 Intro: AI's wild week in mathematics 01:21 What OpenAI's Foundations of RL team does 03:08 Dan's journey: from black holes and quantum gravity to frontier AI 07:04 Are AI systems becoming useful for real science 08:21 The AI math moment: Erdős, OpenAI, DeepMind, and Anthropic 08:52 Why the OpenAI result was an act of exploration 10:25 OpenAI vs. DeepMind: informal reasoning vs. formal proof 12:13 RL 101: learning by doing, not just watching 15:10 Why reinforcement learning works 15:58 How RL breaks: sparse feedback and long-horizon tasks 17:03 RLHF: how human feedback shaped early language models 18:48 Move 37, self-play, and the search for novel strategies 22:16 Explore vs. exploit in scientific discovery 24:49 Why RL may now be "the cake," not the cherry on top 25:46 Why RL started working with large language models 27:29 Is RL "sucking supervision through a straw"? 28:47 Why language may be the grounding layer for intelligence 31:46 A contrarian take on the Bitter Lesson 32:41 What test-time compute actually is 34:50 How RL gives models the ability to think 35:40 Verifiable rewards, math, coding, and the messy real world 38:00 What physics can teach us about AI 42:08 Is there a thermodynamics of AI? 43:08 From Erdős problems to Einstein-level AI 45:16 Is AI already doing original science? 45:51 How far are we from AI automating AI research 47:41 Why Dan is excited about the future of science

Why AI Can Now Make Discoveries - my conversation with Dan Roberts, Lead of the Foundations of Reinforcement Learning team at OpenAI 00:00 Intro: AI's wild week in mathematics 01:21 What OpenAI's Foundations of RL team does 03:08 Dan's journey: from black holes and quantum gravity to frontier AI 07:04 Are AI systems becoming useful for real science 08:21 The AI math moment: Erdős, OpenAI, DeepMind, and Anthropic 08:52 Why the OpenAI result was an act of exploration 10:25 OpenAI vs. DeepMind: informal reasoning vs. formal proof 12:13 RL 101: learning by doing, not just watching 15:10 Why reinforcement learning works 15:58 How RL breaks: sparse feedback and long-horizon tasks 17:03 RLHF: how human feedback shaped early language models 18:48 Move 37, self-play, and the search for novel strategies 22:16 Explore vs. exploit in scientific discovery 24:49 Why RL may now be "the cake," not the cherry on top 25:46 Why RL started working with large language models 27:29 Is RL "sucking supervision through a straw"? 28:47 Why language may be the grounding layer for intelligence 31:46 A contrarian take on the Bitter Lesson 32:41 What test-time compute actually is 34:50 How RL gives models the ability to think 35:40 Verifiable rewards, math, coding, and the messy real world 38:00 What physics can teach us about AI 42:08 Is there a thermodynamics of AI? 43:08 From Erdős problems to Einstein-level AI 45:16 Is AI already doing original science? 45:51 How far are we from AI automating AI research 47:41 Why Dan is excited about the future of science

Matt Turck

64,094 Aufrufe • vor 23 Tagen

Self-Evolving AI : New MIT AI Rewrites its Own Code and it’s Changing Everything | Julian Horsey, Geeky Gadgets TL;DR Key Takeaways : - MIT’s SEAL framework introduces “self-adapting language models” that autonomously enhance their capabilities by generating synthetic training data, self-editing, and updating internal parameters. - SEAL’s self-adaptation process mirrors human learning, allowing continuous improvement and dynamic adaptation to new tasks without relying on external datasets. - Reinforcement learning serves as a feedback mechanism in SEAL, rewarding effective self-edits and making sure sustained progress and goal alignment. SEAL overcomes AI’s reliance on pre-existing datasets by generating its own training material, excelling in long-term task retention and complex problem-solving scenarios. - Potential applications of SEAL include autonomous robotics, personalized education, and advanced problem-solving in fields like healthcare, logistics, and scientific research. --- What if artificial intelligence could not only learn but also rewrite its own code to become smarter over time? This is no longer a futuristic fantasy—MIT’s new “self-adapting language models” (SEAL) framework has made it a reality. Unlike traditional AI systems that rely on external datasets and human intervention to improve, SEAL takes a bold leap forward by autonomously generating its own training data and refining its internal processes. In essence, this AI doesn’t just evolve—it rewires itself, mirroring the way humans adapt through trial, error, and self-reflection. The implications are staggering: a system that can independently enhance its capabilities could redefine the boundaries of what AI can achieve, from solving complex problems to adapting in real time to unforeseen challenges. In this exploration by Wes Roth of MIT’s innovative SEAL framework, you’ll uncover how this self-improving AI works and why it’s a fantastic option for the field of artificial intelligence. From its ability to overcome the “data wall” that limits many current systems to its use of reinforcement learning as a feedback mechanism, SEAL introduces a level of autonomy and adaptability that was previously unimaginable. Imagine AI systems that can retain knowledge over time, dynamically adjust to new tasks, and operate with minimal human oversight. Whether you’re intrigued by its potential for autonomous robotics, personalized education, or advanced problem-solving, SEAL’s ability to rewrite its own rules promises to reshape the future of technology. Could this be the first step toward truly independent, self-evolving AI? What Sets SEAL Apart? The SEAL framework introduces a novel concept of self-adaptation, distinguishing it from traditional AI models. Unlike conventional systems that depend on external datasets for updates, SEAL enables AI to generate synthetic training data independently. This self-generated data is then used to iteratively refine the model, making sure continuous improvement. By persistently updating its internal parameters, SEAL enables AI systems to dynamically adapt to new tasks and inputs. To better illustrate this, consider how humans learn. When faced with a new concept, you might take notes, revisit them, and refine your understanding as you gather more information. SEAL mirrors this process by continuously refining its internal knowledge and performance through iterative self-improvement. This capability allows SEAL to evolve in real time, making it uniquely suited for tasks requiring adaptability and long-term learning. The Role of Reinforcement Learning in SEAL Reinforcement learning plays a critical role in the SEAL framework, acting as a feedback mechanism that evaluates the effectiveness of the model’s self-edits. It rewards changes that enhance performance, creating a cycle of continuous improvement. Over time, this feedback loop optimizes the system’s ability to generate and apply edits, making sure sustained progress. This process is analogous to how humans learn through trial and error. By rewarding effective changes, SEAL aligns its self-generated data and edits with desired outcomes. The integration of reinforcement learning not only enhances the system’s adaptability but also ensures it remains focused on achieving specific goals. This structured feedback mechanism is a cornerstone of SEAL’s ability to refine itself autonomously and efficiently. Real-World Applications and Testing SEAL has demonstrated remarkable performance across various applications, particularly in tasks requiring the integration of factual knowledge and advanced question-answering capabilities. For instance, when tested on benchmarks like the ARC AGI, SEAL outperformed other models by effectively generating and using synthetic data. This ability to create its own training material addresses a significant limitation of current AI systems: their reliance on pre-existing datasets. SEAL’s capacity for long-term task retention and dynamic adaptation further enhances its utility. It excels in scenarios that demand sustained focus and coherence, such as answering complex questions or adapting to evolving objectives. By using its iterative learning process, SEAL is equipped to handle these challenges with exceptional efficiency, making it a valuable tool for a wide range of real-world applications. Overcoming AI’s Data Limitations One of SEAL’s most promising features is its ability to overcome the “data wall” that constrains many AI systems today. By generating synthetic data, SEAL ensures a continuous supply of training material, allowing sustained development without relying on external datasets. This capability is particularly valuable for autonomous AI systems that must operate independently over extended periods. Additionally, SEAL addresses a critical weakness in many current AI models: their struggle with coherence and task retention over long durations. By emulating human learning processes, SEAL enables AI systems to manage complex, long-term tasks with minimal human intervention. This ability to retain and apply knowledge over time positions SEAL as a fantastic tool for advancing AI capabilities. Potential Applications and Future Impact The introduction of SEAL marks a significant milestone in AI research, opening new possibilities for self-improving systems. Its ability to dynamically adapt, retain knowledge, and generate its own training data has far-reaching implications for the future of AI development. Potential applications include: - Autonomous robotics: Systems that can adapt to changing environments and perform tasks with minimal human oversight. - Personalized education: AI-driven platforms that tailor learning experiences to individual needs and preferences. - Advanced problem-solving: Applications in fields such as healthcare, logistics, and scientific research, where adaptability and precision are critical. Read more:

Self-Evolving AI : New MIT AI Rewrites its Own Code and it’s Changing Everything | Julian Horsey, Geeky Gadgets TL;DR Key Takeaways : - MIT’s SEAL framework introduces “self-adapting language models” that autonomously enhance their capabilities by generating synthetic training data, self-editing, and updating internal parameters. - SEAL’s self-adaptation process mirrors human learning, allowing continuous improvement and dynamic adaptation to new tasks without relying on external datasets. - Reinforcement learning serves as a feedback mechanism in SEAL, rewarding effective self-edits and making sure sustained progress and goal alignment. SEAL overcomes AI’s reliance on pre-existing datasets by generating its own training material, excelling in long-term task retention and complex problem-solving scenarios. - Potential applications of SEAL include autonomous robotics, personalized education, and advanced problem-solving in fields like healthcare, logistics, and scientific research. --- What if artificial intelligence could not only learn but also rewrite its own code to become smarter over time? This is no longer a futuristic fantasy—MIT’s new “self-adapting language models” (SEAL) framework has made it a reality. Unlike traditional AI systems that rely on external datasets and human intervention to improve, SEAL takes a bold leap forward by autonomously generating its own training data and refining its internal processes. In essence, this AI doesn’t just evolve—it rewires itself, mirroring the way humans adapt through trial, error, and self-reflection. The implications are staggering: a system that can independently enhance its capabilities could redefine the boundaries of what AI can achieve, from solving complex problems to adapting in real time to unforeseen challenges. In this exploration by Wes Roth of MIT’s innovative SEAL framework, you’ll uncover how this self-improving AI works and why it’s a fantastic option for the field of artificial intelligence. From its ability to overcome the “data wall” that limits many current systems to its use of reinforcement learning as a feedback mechanism, SEAL introduces a level of autonomy and adaptability that was previously unimaginable. Imagine AI systems that can retain knowledge over time, dynamically adjust to new tasks, and operate with minimal human oversight. Whether you’re intrigued by its potential for autonomous robotics, personalized education, or advanced problem-solving, SEAL’s ability to rewrite its own rules promises to reshape the future of technology. Could this be the first step toward truly independent, self-evolving AI? What Sets SEAL Apart? The SEAL framework introduces a novel concept of self-adaptation, distinguishing it from traditional AI models. Unlike conventional systems that depend on external datasets for updates, SEAL enables AI to generate synthetic training data independently. This self-generated data is then used to iteratively refine the model, making sure continuous improvement. By persistently updating its internal parameters, SEAL enables AI systems to dynamically adapt to new tasks and inputs. To better illustrate this, consider how humans learn. When faced with a new concept, you might take notes, revisit them, and refine your understanding as you gather more information. SEAL mirrors this process by continuously refining its internal knowledge and performance through iterative self-improvement. This capability allows SEAL to evolve in real time, making it uniquely suited for tasks requiring adaptability and long-term learning. The Role of Reinforcement Learning in SEAL Reinforcement learning plays a critical role in the SEAL framework, acting as a feedback mechanism that evaluates the effectiveness of the model’s self-edits. It rewards changes that enhance performance, creating a cycle of continuous improvement. Over time, this feedback loop optimizes the system’s ability to generate and apply edits, making sure sustained progress. This process is analogous to how humans learn through trial and error. By rewarding effective changes, SEAL aligns its self-generated data and edits with desired outcomes. The integration of reinforcement learning not only enhances the system’s adaptability but also ensures it remains focused on achieving specific goals. This structured feedback mechanism is a cornerstone of SEAL’s ability to refine itself autonomously and efficiently. Real-World Applications and Testing SEAL has demonstrated remarkable performance across various applications, particularly in tasks requiring the integration of factual knowledge and advanced question-answering capabilities. For instance, when tested on benchmarks like the ARC AGI, SEAL outperformed other models by effectively generating and using synthetic data. This ability to create its own training material addresses a significant limitation of current AI systems: their reliance on pre-existing datasets. SEAL’s capacity for long-term task retention and dynamic adaptation further enhances its utility. It excels in scenarios that demand sustained focus and coherence, such as answering complex questions or adapting to evolving objectives. By using its iterative learning process, SEAL is equipped to handle these challenges with exceptional efficiency, making it a valuable tool for a wide range of real-world applications. Overcoming AI’s Data Limitations One of SEAL’s most promising features is its ability to overcome the “data wall” that constrains many AI systems today. By generating synthetic data, SEAL ensures a continuous supply of training material, allowing sustained development without relying on external datasets. This capability is particularly valuable for autonomous AI systems that must operate independently over extended periods. Additionally, SEAL addresses a critical weakness in many current AI models: their struggle with coherence and task retention over long durations. By emulating human learning processes, SEAL enables AI systems to manage complex, long-term tasks with minimal human intervention. This ability to retain and apply knowledge over time positions SEAL as a fantastic tool for advancing AI capabilities. Potential Applications and Future Impact The introduction of SEAL marks a significant milestone in AI research, opening new possibilities for self-improving systems. Its ability to dynamically adapt, retain knowledge, and generate its own training data has far-reaching implications for the future of AI development. Potential applications include: - Autonomous robotics: Systems that can adapt to changing environments and perform tasks with minimal human oversight. - Personalized education: AI-driven platforms that tailor learning experiences to individual needs and preferences. - Advanced problem-solving: Applications in fields such as healthcare, logistics, and scientific research, where adaptability and precision are critical. Read more:

Owen Gregorian

70,672 Aufrufe • vor 1 Jahr