Loading video...
Video Failed to Load
Lil update on fixing deepseeks GRPO issues when training a small medical model. shoutout to Zichen Liu & leloy! 's weekend work 1.5B llms medmcqa score went up from 37% to 52%
26,449 views • 1 year ago •via X (Twitter)
11 Comments

@leloykun 31% not 37%* , the qwen 3B model went from 37 to 49 smaller 1.5 deepseek-distrill model suprisingly went from 31-52. Will be posting model, training code etc here as part of @JohnsonThomasMD 's project.

Togoda is Google on Steroids with AI summaries . 🚀 The only thematic AI search engine.👀 It's 100% private with third party proxy. 🧨 Try it today & experience the difference! 👉Follow us @togoda_com 👈 🚀Help us grow & share this post!🚀

it’s pretty crazy that you can get an AI that runs on a smartwatch to talk to itself long enough that it figures out how to get a D on the US Medical Licensing exam. WITHOUT ADDING EXTRA MATERIAL. Just reinforcement learning what it already had. Human doctor score is 71%

tfw when you end up beating an excellent model like reka flash 21B with something that can run on 5 year old phones

There's more to fix but yeah Leloy was right the whole time. Funny enough the first fix to the trainer from the university of singapores's team was called Dr GRPO

@zzlccc @leloykun Fucking genius

@zzlccc @leloykun Hey nisten, amazing project. Will be soon doing a rl run on financial regulations data Would help a ton if you could drop some resources

@zzlccc @leloykun I think it’s fixed now in trl, see

@zzlccc @leloykun damn that was fast, there may be an issue with that one too

@leloykun thread here for future reference,

it looks like a fix for this just got merged in an hour ago. You’ll need to rebuild TRL trainer from source however to apply it because its not in the pip package. Not entirely shure it fully fixes the length bias, will test. @cognitivecompai

