Загрузка видео...

Не удалось загрузить видео

На главную

Introducing Reinforcement-Learned Teachers (RLTs): Transforming how we teach LLMs to reason with reinforcement learning (RL). Blog: Paper: Traditional RL focuses on “learning to solve” challenging problems with expensive LLMs and constitutes a key step in making student AI systems ultimately acquire reasoning capabilities via distillation and cold-starting. Enter our...

179,276 просмотров • 1 год назад •via X (Twitter)

Комментарии: 9

Фото профиля iMuffin
iMuffin1 год назад

Another banger from Sakana. You guys are dropping fantastic papers every week! I can't keep up.

Фото профиля DALNK ʕ •ᴥ•ʔつ━☆
DALNK ʕ •ᴥ•ʔつ━☆1 год назад

Surprised it took this long for people to find this trick

Фото профиля Zinan Lin
Zinan Lin1 год назад

Great work! Just to share — in our NeurIPS 2024 paper (arXiv’ed a year ago), we also proposed the Learning by Teaching idea and showed how it can enhance LLMs during both training and prompting.

Фото профиля Pehdrew
Pehdrew1 год назад

🧐

Фото профиля Kabir Shaurya
Kabir Shaurya1 год назад

Great drop

Фото профиля R✨
R✨1 год назад

👍

Фото профиля Xhris57
Xhris571 год назад

Thank you for sharing. Here’s a distilled summary of the Reinforcement-Learned Teachers (RLTs) concept and its significance: ⸻ 🧠 Reinforcement-Learned Teachers (RLTs) Overview 🔹 What Are RLTs? RLTs are a new class of LLM-based teacher models trained via reinforcement learning (RL) to produce high-quality, step-by-step explanations for reasoning tasks. Unlike standard RL-trained solvers, RLTs focus on teaching—not just solving. ⸻ 🔹 How RLTs Work •They’re prompted with both a problem and its solution. •They’re trained to generate interpretable reasoning paths rather than just answers. •The goal is to make them effective at distilling this reasoning into student models. ⸻ 🚀 Key Innovations & Results 1.Teaching Over Solving: RLTs don’t just learn to solve tasks—they learn how to explain them clearly for downstream distillation. 2.Efficiency Gains: A 7B parameter RLT can outperform much larger LLMs (like 70B) in teaching effectiveness, particularly in distilling reasoning ability into smaller student models. 3.Cold Start Generalization: RLTs can initialize student models with no prior task exposure—significantly improving cold-start training scenarios. 4.Teacher Smaller Than Student: A 7B RLT can successfully distill into a 32B student, challenging assumptions about size-based hierarchy in knowledge transfer. ⸻ 📊 Performance Impact •Superior results in competitive reasoning benchmarks •Enhanced training of student LLMs on step-by-step reasoning tasks •Opens up new paths for RL-driven education paradigms in AI development ⸻ 🧩 Implications •More interpretable AI: Teaching-based RL improves clarity over black-box solutioning. •Efficient scaling: Smaller models can teach larger ones—reducing compute needs. •Better alignment and control: RLTs allow for more structured reasoning supervision via RL. ⸻ 🔗 Resources •📜 Paper •🧑‍💻 Code •📝 Blog ⸻ Let me know if you’d like a diagrammatic summary, use-case extrapolation, or integration suggestion into your own symbolic framework.

Фото профиля Emily
Emily1 год назад

Good update 👌

Фото профиля Shashank Jain
Shashank Jain1 год назад

excellent work..Just had one doubt..Do we use the teacher feedback along with reward to fine tune the student or its just reward?

Похожие видео

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,442 просмотров • 1 год назад