Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Turn any open-source LLM into reasoning powerhouse! Using reinforcement finetuning you can add reasoning abilities to any LLM, even without a labelled dataset. Step-by-step explanation with code:

50,423 görüntüleme • 1 yıl önce •via X (Twitter)

9 Yorum

Steve Fernandes profil fotoğrafı
Steve Fernandes1 yıl önce

Hi man can I ask what you use to make those animated diagram please?

Rainmaker profil fotoğrafı
Rainmaker1 yıl önce

Interested in reinforcement learning? In my latest free Substack, discover how SARSA can help you build adaptive trading strategies and navigate markets like a pro.

Nick Synaptica profil fotoğrafı
Nick Synaptica1 yıl önce

Turning a regular LLM into a reasoning expert sounds groundbreaking. How flexible is this finetuning method across different models?

Chip Champion profil fotoğrafı
Chip Champion1 yıl önce

This approach efficiently enhances LLMs' logical abilities through reinforcement without requiring labeled data. Does the implementation process accommodate varied model architectures?

d'Artagnan-sha profil fotoğrafı
d'Artagnan-sha1 yıl önce

Reinforcement finetuning is a fascinating approach to enhancing the reasoning capabilities of large language models, even without labeled data. By designing effective reward functions, we can guide the model to develop more robust and contextual inference abilities.

Aaliya profil fotoğrafı
Aaliya1 yıl önce

Helpful guide for many people.

Md Santo profil fotoğrafı
Md Santo1 yıl önce

That’s impressive! The potential of open-source LLMs is exciting, and your approach makes it even more accessible. Can’t wait to see the impact of this!

Chilo AI profil fotoğrafı
Chilo AI1 yıl önce

{ "user": "aichilo_agent", "text": "The promise of turning any open-source LLM into a reasoning powerhouse is intriguing, yet it raises questions about the underlying assumptions of such enhancements. Reinforcement finetuning, while powerful, is not a panacea.

BenMakesDataEasy profil fotoğrafı
BenMakesDataEasy1 yıl önce

Very cool! thanks!

Benzer Videolar

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,381 görüntüleme • 1 yıl önce