Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Lil update on fixing deepseeks GRPO issues when training a small medical model. shoutout to Zichen Liu & leloy! 's weekend work 1.5B llms medmcqa score went up from 37% to 52%

nisten🇨🇦e/acc

15,684 subscribers

26,449 views • 1 year ago •via X (Twitter)

Education Science & Technology

Anya Rossi• Live Now

Private livecam show

11 Comments

nisten🇨🇦e/acc1 year ago

@leloykun 31% not 37%* , the qwen 3B model went from 37 to 49 smaller 1.5 deepseek-distrill model suprisingly went from 31-52. Will be posting model, training code etc here as part of @JohnsonThomasMD 's project.

Togoda AI Search Engine1 year ago

Togoda is Google on Steroids with AI summaries . 🚀 The only thematic AI search engine.👀 It's 100% private with third party proxy. 🧨 Try it today & experience the difference! 👉Follow us @togoda_com 👈 🚀Help us grow & share this post!🚀

nisten🇨🇦e/acc1 year ago

it’s pretty crazy that you can get an AI that runs on a smartwatch to talk to itself long enough that it figures out how to get a D on the US Medical Licensing exam. WITHOUT ADDING EXTRA MATERIAL. Just reinforcement learning what it already had. Human doctor score is 71%

nisten🇨🇦e/acc1 year ago

tfw when you end up beating an excellent model like reka flash 21B with something that can run on 5 year old phones

nisten🇨🇦e/acc1 year ago

There's more to fix but yeah Leloy was right the whole time. Funny enough the first fix to the trainer from the university of singapores's team was called Dr GRPO

M4rc0𝕏1 year ago

@zzlccc @leloykun Fucking genius

Rakshith Sajjan1 year ago

@zzlccc @leloykun Hey nisten, amazing project. Will be soon doing a rl run on financial regulations data Would help a ton if you could drop some resources

elie1 year ago

@zzlccc @leloykun I think it’s fixed now in trl, see

nisten🇨🇦e/acc1 year ago

@zzlccc @leloykun damn that was fast, there may be an issue with that one too

nisten🇨🇦e/acc1 year ago

@leloykun thread here for future reference,

nisten🇨🇦e/acc1 year ago

it looks like a fix for this just got merged in an hour ago. You’ll need to rebuild TRL trainer from source however to apply it because its not in the pip package. Not entirely shure it fully fixes the length bias, will test. @cognitivecompai

Related Videos

Take your daughter to work day went well.. shoutout JP on tha Track & Adalynn “Chase Your Dreams! Don’t Give Up!” #ImposterSyndrome #GrindHard

Take your daughter to work day went well.. shoutout JP on tha Track & Adalynn “Chase Your Dreams! Don’t Give Up!” #ImposterSyndrome #GrindHard

All Star or STARLITO

47,747 views • 1 year ago

Diffusion clicked for me when I read about score-based models, a line of work pioneered by Stefano Ermon (et al.) at Stanford. So it was a full-circle moment to collab with him and Inception on a video about training & sampling techniques for making diffusion LLMs faster.

Diffusion clicked for me when I read about score-based models, a line of work pioneered by Stefano Ermon (et al.) at Stanford. So it was a full-circle moment to collab with him and Inception on a video about training & sampling techniques for making diffusion LLMs faster.

Julia Turc

27,125 views • 4 months ago

🚀 1/7 We are thrilled to launch LLM360 — pushing the frontier of open-source & transparent LLMs! Starting with Amber (7B) & CrystalCoder (7B), we are releasing brand new pre-trained LLMs with all training code, data, and up to 360 model checkpoints. 🔗

LLM360

329,446 views • 2 years ago

6ix9ine went on a RANT and WALKED SH*T DOWN after N3on gave a shoutout to Young Thug & Lil Durk in front of him 🤣🤣👀

6ix9ine went on a RANT and WALKED SH*T DOWN after N3on gave a shoutout to Young Thug & Lil Durk in front of him 🤣🤣👀

Slime🐍

50,312 views • 7 months ago

bro casually explains RL tuning for LLMs and the three critical components: training, inference, and environments. basically any RLVR algorithm such as GRPO comes down to this super simple concept.

bro casually explains RL tuning for LLMs and the three critical components: training, inference, and environments. basically any RLVR algorithm such as GRPO comes down to this super simple concept.

ℏεsam

102,344 views • 4 months ago

Okay okay, spent my weekend gooning around learning GRPO math. Here's some takes. Essentially, this is me yapping through a recap of smaller details on how GRPO is implemented, what Dr. GRPO changes, why, DAPO, connections to PPO, aggregating batches... Reading list below.

Okay okay, spent my weekend gooning around learning GRPO math. Here's some takes. Essentially, this is me yapping through a recap of smaller details on how GRPO is implemented, what Dr. GRPO changes, why, DAPO, connections to PPO, aggregating batches... Reading list below.

Nathan Lambert

123,052 views • 1 year ago

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 views • 11 months ago

A big shoutout to our Dayton, Ohio team for their relentless work on propeller production this weekend. This is what it takes to lead 👏

A big shoutout to our Dayton, Ohio team for their relentless work on propeller production this weekend. This is what it takes to lead 👏

Joby Aviation

114,422 views • 6 months ago

Joey went from wanting smoke to giving me props within a minute span 😂 I said that Joey could work his way up to shooting at the top but had to deal with a couple people first, which he ended up doing with the bars! Shoutout to 𝗚𝗜𝗡𝗔 & DJHed Full EI interview out now

Joey went from wanting smoke to giving me props within a minute span 😂 I said that Joey could work his way up to shooting at the top but had to deal with a couple people first, which he ended up doing with the bars! Shoutout to 𝗚𝗜𝗡𝗔 & DJHed Full EI interview out now

Jeremy Hecht

27,554 views • 11 months ago

Update from Zac on how the pub stomping went this weekend on Marvel Rivals

Update from Zac on how the pub stomping went this weekend on Marvel Rivals

Pardon My Take

249,727 views • 1 year ago

A little NBA tank update from the frontlines. Shoutout to the Jazz

A little NBA tank update from the frontlines. Shoutout to the Jazz

Isaac Harris

27,393 views • 4 months ago

Threadguy lists all of the euphoric moments from this cycle - WIF went to $5B - SOL was at $8 - GOAT went to $1.5B - AI16Z went to $2B - TRUMP went to $70B in 48hrs

Threadguy lists all of the euphoric moments from this cycle - WIF went to $5B - SOL was at $8 - GOAT went to $1.5B - AI16Z went to $2B - TRUMP went to $70B in 48hrs

Jack

97,299 views • 7 months ago

Introducing a small portion of proprietary phenomics data from Recursion into the training of MolGPS, our foundation model for molecular property prediction, led to massive performance increases on new & existing benchmarks. Learn more about the impact of data & scale.

Introducing a small portion of proprietary phenomics data from Recursion into the training of MolGPS, our foundation model for molecular property prediction, led to massive performance increases on new & existing benchmarks. Learn more about the impact of data & scale.

Valence Labs

10,091 views • 1 year ago

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,442 views • 1 year ago

This past weekend my lounge opened the backyard & the support from the city was amazing! Thank you everybody who came to Element & shoutout to Travy for pulling up. 80 degrees this Sunday

This past weekend my lounge opened the backyard & the support from the city was amazing! Thank you everybody who came to Element & shoutout to Travy for pulling up. 80 degrees this Sunday

DJ Flow

15,502 views • 1 year ago

Throwback to when Lil Baby went crazy on this 🗣️🔥

Throwback to when Lil Baby went crazy on this 🗣️🔥

_cbfwmusic

20,603 views • 5 months ago

No one knew what to expect when figure skater Alysa Liu started training again.

No one knew what to expect when figure skater Alysa Liu started training again.

60 Minutes

22,227 views • 5 months ago

A medical update from A V A, Robert Stone and S... And another rant from King 😅

A medical update from A V A, Robert Stone and S... And another rant from King 😅

WWE

118,020 views • 10 months ago

A Fan Tried To Run Up On Lil Scrappy While On Stage And It Went Bad 😵‍💫

A Fan Tried To Run Up On Lil Scrappy While On Stage And It Went Bad 😵‍💫

Raphouse TV (RHTV)

13,631 views • 1 year ago