Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Reinforcement Learning from Human Feedback (RLHF) is gaining traction. This field aims to make AI more responsible by including human values and preferences. In this video, Nathan Lambert, a research scientist and RLHF team lead at Hugging Face explores its inner workings, applications and industry impact. RLHF has gained...

27,005 Aufrufe • vor 2 Jahren •via X (Twitter)

8 Kommentare

Profilbild von Muratcan Koylan
Muratcan Koylanvor 2 Jahren

@huggingface You can watch the entire video here. [Invited talk by Nathan Lambert on March 9, 2023 at UCL DARK.]

Profilbild von Muratcan Koylan
Muratcan Koylanvor 2 Jahren

If you are interested in prompt engineering and LLM models, I highly recommend:

Profilbild von Nicolas
Nicolasvor 2 Jahren

@natolambert @huggingface Thank you for sharing this insightful content. Is truly fascinating to see how AI is evolving to incorporate human values and preferences

Profilbild von Muratcan Koylan
Muratcan Koylanvor 2 Jahren

@natolambert @huggingface Thank you for your support, Nicolas! Always a pleasure! My last rt indicates how RLHF can be automated by ai agents and some researchers claim that the results are better than human feedback 🫨

Profilbild von Nathan Lambert
Nathan Lambertvor 2 Jahren

@huggingface Here are some much, much more recent talks covering RLHF. Thanks for sharing my work!

Profilbild von Muratcan Koylan
Muratcan Koylanvor 2 Jahren

@huggingface Thanks for the contribution to the community, Nathan 🙏🏻

Profilbild von GAIO
GAIOvor 2 Jahren

@natolambert @huggingface RLHF is similar to the psychology of Pavlov’s dog. but how do you incentivize / reward an AI? what are AI treats?

Profilbild von Muratcan Koylan
Muratcan Koylanvor 2 Jahren

Pavlov uses food as a treat. AI models receives numerical rewards to adjust the internal weights and biases of the model to improve its performance. Think like "good" or "bad" outcomes. Human Feedback in RLHF, human reviewers rank them based on quality. With enough feedback, the model gets better at producing the desired outputs. Here you can find more details about the possibilities and limitations of RLHF:

Ähnliche Videos

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,442 Aufrufe • vor 1 Jahr

Karpathy's prediction about RL is coming true now! He called reward functions unreliable and argued that a single reward number is too low-dimensional to teach an agent what "good" means for complex tasks. To solve this, Agents need a knowledge-guided review as a higher-dimensional feedback channel. Every major AI lab trains models with RL today (OpenAI, Anthropic, DeepSeek). And their key bottleneck has always been the reward functions. GRPO by DeepSeek worked well for math and code because the environment gave a binary signal. But for real agent tasks, someone still has to hand-code the scoring function. That takes days and breaks every time the pipeline changes. RULER (implemented in OpenPipe ART, 10k stars) addresses the exact problem Karpathy identified. The reward criteria are defined in plain English, and an LLM evaluates each trajectory against that description to provide feedback for training. I trained a Qwen3 1.4B agent that plays 2048 using GRPO with this exact workflow. In this case, the agent saw the board, picked a direction, and RULER evaluated the outcome, all from this natural language definition. You can see the full implementation on GitHub and try it yourself. Here's the ART Repo: (don't forget to star it ⭐ ) Just like RLHF replaced manual rankings and GRPO replaced the critic model, natural language rewards are replacing hand-coded scoring functions. RL reward engineering is now prompt engineering. I wrote a full walkthrough covering RL for LLM agents, from RLHF to GRPO to RULER, in the article below.

Avi Chawla

348,877 Aufrufe • vor 1 Monat

[RLHF] by Hand ✍️ Yesterday, Jan Leike (Jan Leike) announced he is joining #Anthropic to lead their "super-alignment" mission. He is the co-inventor of Reinforcement Learning with Human Feedback (#RLHF). How does RLHF work? [1] Given ↳ Reward Model (RM) ↳ Large Language Model (LLM) ↳ Two (Prompt, Next) Pairs 🟪 TRAIN RM Goal: Learn to give higher rewards to winners [2] Preferences ↳ A human reviews the two pairs and picks a "winner" ↳ (doc is, him) Embeddings ↳ This prompt has never received human feedback directly ↳ [S] is the special start symbol [11] Transformer ↳ Attention (yellow) ↳ Feed Forward (4x2 weight and bias matrix) ↳ Output: 3 "transformed" feature vector, one per position ↳ More details in my previous post 8. Transformer [] [12] Output Probabilities ↳ Apply a linear layer to map each transformed feature vector to a probability distribution over the vocabulary. [13] Sample ↳ Apply the greedy method, which is to pick the word with the highest score ↳ For output 1 and 2, the model accurately predicts the next word ↳ For 3rd output position, the model's predicts "him" [14] Reward Model ↳ The new pair (CEO is, him) is fed to the reward model ↳ The process is same as [3]-[6] ↳ Output: Reward = 3 [15] Loss Gradient ↳ We set the loss as the negative of the reward. ↳ The loss gradient is simply a constant -1. ↳ Run backpropagation and gradient descent to update LLM's weights and biases (red border)

Tom Yeh

79,758 Aufrufe • vor 2 Jahren

Important announcement (with job opportunities!): I’m thrilled to share that I just joined Lila Sciences as SVP of Open-Endedness! Lila is a new name in the AI space, but one you will be hearing a lot from. Their unique mission to pursue Scientific Superintelligence could not align better with my interest in open-ended creativity. Lila is about the entire scientific enterprise, not just a single constrained domain like drug discovery. And of course, the history of scientific progress is an unmistakable precedent for open-ended discovery and Why Greatness Cannot Be Planned. For AI itself to join this epic quest beyond just a supporting role, to create new magic that the human mind has yet to imagine, we will need to take creativity seriously well beyond the conventional pillars of more data, more compute, and more time on inference. That may be where the frontier labs currently are, but if you know about open-endedness, you know that open-ended creativity is not about taking a test and getting a good score. Lila understands the difference and is willing to invest to help me create the best open-endedness team in the world (as part of their overall AI effort) to make it happen. So I want to invite you, if you think this opportunity sounds as exciting as I do, to reach out if you’re interested in joining my team. I’m looking for a diverse range of expertise: pre-training, fine-tuning, RLHF, distillation, mechanistic interpretability, and yes - quality diversity techniques! We’re going to do things on my team that nobody else is doing. This will not be the usual roadmap. Compensation will be industry competitive and our team will be based in San Francisco with a hybrid work schedule. 1/n

Kenneth Stanley

63,058 Aufrufe • vor 1 Jahr

Why AI Can Now Make Discoveries - my conversation with Dan Roberts, Lead of the Foundations of Reinforcement Learning team at OpenAI 00:00 Intro: AI's wild week in mathematics 01:21 What OpenAI's Foundations of RL team does 03:08 Dan's journey: from black holes and quantum gravity to frontier AI 07:04 Are AI systems becoming useful for real science 08:21 The AI math moment: Erdős, OpenAI, DeepMind, and Anthropic 08:52 Why the OpenAI result was an act of exploration 10:25 OpenAI vs. DeepMind: informal reasoning vs. formal proof 12:13 RL 101: learning by doing, not just watching 15:10 Why reinforcement learning works 15:58 How RL breaks: sparse feedback and long-horizon tasks 17:03 RLHF: how human feedback shaped early language models 18:48 Move 37, self-play, and the search for novel strategies 22:16 Explore vs. exploit in scientific discovery 24:49 Why RL may now be "the cake," not the cherry on top 25:46 Why RL started working with large language models 27:29 Is RL "sucking supervision through a straw"? 28:47 Why language may be the grounding layer for intelligence 31:46 A contrarian take on the Bitter Lesson 32:41 What test-time compute actually is 34:50 How RL gives models the ability to think 35:40 Verifiable rewards, math, coding, and the messy real world 38:00 What physics can teach us about AI 42:08 Is there a thermodynamics of AI? 43:08 From Erdős problems to Einstein-level AI 45:16 Is AI already doing original science? 45:51 How far are we from AI automating AI research 47:41 Why Dan is excited about the future of science

Matt Turck

64,094 Aufrufe • vor 23 Tagen

Self-Evolving AI : New MIT AI Rewrites its Own Code and it’s Changing Everything | Julian Horsey, Geeky Gadgets TL;DR Key Takeaways : - MIT’s SEAL framework introduces “self-adapting language models” that autonomously enhance their capabilities by generating synthetic training data, self-editing, and updating internal parameters. - SEAL’s self-adaptation process mirrors human learning, allowing continuous improvement and dynamic adaptation to new tasks without relying on external datasets. - Reinforcement learning serves as a feedback mechanism in SEAL, rewarding effective self-edits and making sure sustained progress and goal alignment. SEAL overcomes AI’s reliance on pre-existing datasets by generating its own training material, excelling in long-term task retention and complex problem-solving scenarios. - Potential applications of SEAL include autonomous robotics, personalized education, and advanced problem-solving in fields like healthcare, logistics, and scientific research. --- What if artificial intelligence could not only learn but also rewrite its own code to become smarter over time? This is no longer a futuristic fantasy—MIT’s new “self-adapting language models” (SEAL) framework has made it a reality. Unlike traditional AI systems that rely on external datasets and human intervention to improve, SEAL takes a bold leap forward by autonomously generating its own training data and refining its internal processes. In essence, this AI doesn’t just evolve—it rewires itself, mirroring the way humans adapt through trial, error, and self-reflection. The implications are staggering: a system that can independently enhance its capabilities could redefine the boundaries of what AI can achieve, from solving complex problems to adapting in real time to unforeseen challenges. In this exploration by Wes Roth of MIT’s innovative SEAL framework, you’ll uncover how this self-improving AI works and why it’s a fantastic option for the field of artificial intelligence. From its ability to overcome the “data wall” that limits many current systems to its use of reinforcement learning as a feedback mechanism, SEAL introduces a level of autonomy and adaptability that was previously unimaginable. Imagine AI systems that can retain knowledge over time, dynamically adjust to new tasks, and operate with minimal human oversight. Whether you’re intrigued by its potential for autonomous robotics, personalized education, or advanced problem-solving, SEAL’s ability to rewrite its own rules promises to reshape the future of technology. Could this be the first step toward truly independent, self-evolving AI? What Sets SEAL Apart? The SEAL framework introduces a novel concept of self-adaptation, distinguishing it from traditional AI models. Unlike conventional systems that depend on external datasets for updates, SEAL enables AI to generate synthetic training data independently. This self-generated data is then used to iteratively refine the model, making sure continuous improvement. By persistently updating its internal parameters, SEAL enables AI systems to dynamically adapt to new tasks and inputs. To better illustrate this, consider how humans learn. When faced with a new concept, you might take notes, revisit them, and refine your understanding as you gather more information. SEAL mirrors this process by continuously refining its internal knowledge and performance through iterative self-improvement. This capability allows SEAL to evolve in real time, making it uniquely suited for tasks requiring adaptability and long-term learning. The Role of Reinforcement Learning in SEAL Reinforcement learning plays a critical role in the SEAL framework, acting as a feedback mechanism that evaluates the effectiveness of the model’s self-edits. It rewards changes that enhance performance, creating a cycle of continuous improvement. Over time, this feedback loop optimizes the system’s ability to generate and apply edits, making sure sustained progress. This process is analogous to how humans learn through trial and error. By rewarding effective changes, SEAL aligns its self-generated data and edits with desired outcomes. The integration of reinforcement learning not only enhances the system’s adaptability but also ensures it remains focused on achieving specific goals. This structured feedback mechanism is a cornerstone of SEAL’s ability to refine itself autonomously and efficiently. Real-World Applications and Testing SEAL has demonstrated remarkable performance across various applications, particularly in tasks requiring the integration of factual knowledge and advanced question-answering capabilities. For instance, when tested on benchmarks like the ARC AGI, SEAL outperformed other models by effectively generating and using synthetic data. This ability to create its own training material addresses a significant limitation of current AI systems: their reliance on pre-existing datasets. SEAL’s capacity for long-term task retention and dynamic adaptation further enhances its utility. It excels in scenarios that demand sustained focus and coherence, such as answering complex questions or adapting to evolving objectives. By using its iterative learning process, SEAL is equipped to handle these challenges with exceptional efficiency, making it a valuable tool for a wide range of real-world applications. Overcoming AI’s Data Limitations One of SEAL’s most promising features is its ability to overcome the “data wall” that constrains many AI systems today. By generating synthetic data, SEAL ensures a continuous supply of training material, allowing sustained development without relying on external datasets. This capability is particularly valuable for autonomous AI systems that must operate independently over extended periods. Additionally, SEAL addresses a critical weakness in many current AI models: their struggle with coherence and task retention over long durations. By emulating human learning processes, SEAL enables AI systems to manage complex, long-term tasks with minimal human intervention. This ability to retain and apply knowledge over time positions SEAL as a fantastic tool for advancing AI capabilities. Potential Applications and Future Impact The introduction of SEAL marks a significant milestone in AI research, opening new possibilities for self-improving systems. Its ability to dynamically adapt, retain knowledge, and generate its own training data has far-reaching implications for the future of AI development. Potential applications include: - Autonomous robotics: Systems that can adapt to changing environments and perform tasks with minimal human oversight. - Personalized education: AI-driven platforms that tailor learning experiences to individual needs and preferences. - Advanced problem-solving: Applications in fields such as healthcare, logistics, and scientific research, where adaptability and precision are critical. Read more:

Owen Gregorian

70,672 Aufrufe • vor 1 Jahr