Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

I hand-wrote a 500-LoC RL stack to make hacking on RL research much easier. Most RL stacks are either massive and unhackable, or duct-taped research scripts. I am open-sourcing Mithrl, a modular RLVR stack. Next items on my checklist: adding more complex environment examples, supporting multi-gpu + async RL,... show more

omkaar

4,368 subscribers

17,218 views • 3 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Hiring RL Engineer! Started off as a curious project at Lossfunk to push the boundaries of LLMs in social reasoning - we are now building RL environments, data, and benchmarks to simulate more real-world scenarios. If you want to train SoTA RL models over multi-GPUs (H200s/B200s) to unlock next AI frontier, this is for you.

Hiring RL Engineer! Started off as a curious project at Lossfunk to push the boundaries of LLMs in social reasoning - we are now building RL environments, data, and benchmarks to simulate more real-world scenarios. If you want to train SoTA RL models over multi-GPUs (H200s/B200s) to unlock next AI frontier, this is for you.

Satpal Singh Rathore

45,915 views • 10 months ago

To solve AGI, we must first solve Geoguessr For that I built vlm-gym, a simple RL gym written in scratch, in JAX for Qwen3VL-4B (released yesterday) And added Geospot, a RL environment for geolocation and learned VLMs can learn how to geoguess. More:

To solve AGI, we must first solve Geoguessr For that I built vlm-gym, a simple RL gym written in scratch, in JAX for Qwen3VL-4B (released yesterday) And added Geospot, a RL environment for geolocation and learned VLMs can learn how to geoguess. More:

Surya

140,348 views • 8 months ago

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

Prime Intellect

1,137,660 views • 7 months ago

Going to ICRA next week in Atlanta!! We are on a mission to build the most cracked team on humanoid robotics. Hiring the best talents on the research frontier of VLA, world models, RL, and simulation! DM or email me for meetup! linxif@nvidia.com

Going to ICRA next week in Atlanta!! We are on a mission to build the most cracked team on humanoid robotics. Hiring the best talents on the research frontier of VLA, world models, RL, and simulation! DM or email me for meetup! [email protected]

Jim Fan

130,549 views • 1 year ago

Excited to present FastTD3: a simple, fast, and capable off-policy RL algorithm for humanoid control -- with an open-source code to run your own humanoid RL experiments in no time! Thread below 🧵

Excited to present FastTD3: a simple, fast, and capable off-policy RL algorithm for humanoid control -- with an open-source code to run your own humanoid RL experiments in no time! Thread below 🧵

Younggyo Seo

130,968 views • 1 year ago

Okay now i want a classroom like this too where i can enjoy RL with my all boys and girls, idgaf even if they are straight 😭

Okay now i want a classroom like this too where i can enjoy RL with my all boys and girls, idgaf even if they are straight 😭

Tonniᡣ𐭩

146,317 views • 6 months ago

The 13th and final RL GRIME Halloween mix (hurts to type that) is out 👑 RL got Jack Black for the intro and it starts with a new RL GRIME x WINK collab SABLE VALLEY

The 13th and final RL GRIME Halloween mix (hurts to type that) is out 👑 RL got Jack Black for the intro and it starts with a new RL GRIME x WINK collab SABLE VALLEY

Dancing Astronaut

19,816 views • 1 year ago

🤔Want a principled way to RL your diffusion model? Check Data-regularized Reinforcement Learning (DDRL)! Post-train NVIDIA #Cosmos World Foundation models with a million GPU hours! 🤯 Novel formulation ➡️ Theoretically integrates SFT into RL ➡️ Robust to Reward Hacking 🛑 Details: #DDRL #Diffusion #RL #NVIDIA #Cosmos

🤔Want a principled way to RL your diffusion model? Check Data-regularized Reinforcement Learning (DDRL)! Post-train NVIDIA #Cosmos World Foundation models with a million GPU hours! 🤯 Novel formulation ➡️ Theoretically integrates SFT into RL ➡️ Robust to Reward Hacking 🛑 Details: #DDRL #Diffusion #RL #NVIDIA #Cosmos

Haotian Ye

77,612 views • 6 months ago

"Our approach is to build products and conduct research that are in service of accelerated AI deployments. Our platform team builds tools and context primitives that enable faster deployment. Our research team builds frontier systems, including a state-of-the-art RL stack. We then take that research and product and forward-deploy with our customers to help deliver real value." Thanks Founders You Should Know for having us. Open roles at:

"Our approach is to build products and conduct research that are in service of accelerated AI deployments. Our platform team builds tools and context primitives that enable faster deployment. Our research team builds frontier systems, including a state-of-the-art RL stack. We then take that research and product and forward-deploy with our customers to help deliver real value." Thanks Founders You Should Know for having us. Open roles at:

Applied Compute

45,197 views • 3 months ago

RL is a powerful mechanism for training company-specific models on their unique work and data. This is what we do at Applied Compute. A key challenge is how to make RL efficient, because we need runs to be fast (delivered in days), cheap (scalable unit economics), and predictable (not just fast, but reliably fast). Here are some takeaways: • Synchronous RL is wasteful with time and compute. • Asynchronous RL is more efficient but introduces staleness, which causes learning instabilities. • Modeling and simulations can help analytically solve for what configuration leads to optimal efficiency. This allows us to rapidly prototype training configurations, without burning expensive compute cycles on trial runs. Two of our co-founders, Rhythm Garg 🚂 and Linden Li, discussed some of this research at AI Engineer recently, with a focus on the following subproblem: what is the highest throughput way to do RL given a maximum staleness and compute budget?

RL is a powerful mechanism for training company-specific models on their unique work and data. This is what we do at Applied Compute. A key challenge is how to make RL efficient, because we need runs to be fast (delivered in days), cheap (scalable unit economics), and predictable (not just fast, but reliably fast). Here are some takeaways: • Synchronous RL is wasteful with time and compute. • Asynchronous RL is more efficient but introduces staleness, which causes learning instabilities. • Modeling and simulations can help analytically solve for what configuration leads to optimal efficiency. This allows us to rapidly prototype training configurations, without burning expensive compute cycles on trial runs. Two of our co-founders, Rhythm Garg 🚂 and Linden Li, discussed some of this research at AI Engineer recently, with a focus on the following subproblem: what is the highest throughput way to do RL given a maximum staleness and compute budget?

Applied Compute

45,480 views • 6 months ago

Open Duck Mini is standing up using a policy learned with RL in simulation! We are still pushing locomotion, so if you have some knowledge of Isaac Gym and want to help feel free to join us! Thomas Wolf

Open Duck Mini is standing up using a policy learned with RL in simulation! We are still pushing locomotion, so if you have some knowledge of Isaac Gym and want to help feel free to join us! Thomas Wolf

Antoine Pirrone

66,051 views • 1 year ago

I had 14 tabs open just to keep up with AI. arXiv, Papers With Code, every leaderboard, HuggingFace, half a dozen RL-env hubs... So I built one screen for all of it. The Bloomberg terminal for AI research. It's called Sophon 🧵

I had 14 tabs open just to keep up with AI. arXiv, Papers With Code, every leaderboard, HuggingFace, half a dozen RL-env hubs... So I built one screen for all of it. The Bloomberg terminal for AI research. It's called Sophon 🧵

serafim

41,931 views • 1 month ago

We are moving from model controller to RL for locomotion. The policy is starting to look quite robust. Even more convinced now that legs are easier than wheels, given the software complexity has shrunk tremendously with RL.

We are moving from model controller to RL for locomotion. The policy is starting to look quite robust. Even more convinced now that legs are easier than wheels, given the software complexity has shrunk tremendously with RL.

Sankaet

133,130 views • 9 months ago

if i ever get nostalgic for being a uni student, I can just watch this video and it goes away rl quick

if i ever get nostalgic for being a uni student, I can just watch this video and it goes away rl quick

jam - LUX ERA🕊️

3,237,307 views • 1 year ago

🚨 RL for LLMs is finally accessible. Introducing OpenTinker: The first community-driven, open-source framework designed to democratize Reinforcement Learning for LLMs. Inspired by Thinking Machines's amazing Tinker, we realize the biggest bottleneck in agentic LLM research isn’t the math—it’s the setup. Current RL pipelines are messy. Configuring VeRL for every single experiment is a productivity killer. OpenTinker fixed it. 🛠 How OpenTinker Works: Decoupled Design of Server and Client - Setup Once, Run Forever: Configure the OpenTinker backend on your GPU cluster once. - Develop Locally: Define your RL environments directly on your laptop. - Train on the Cloud: Simply point your local client to the backend. The cluster handles the compute; you handle the science. 📉 The 10x Development Efficiency Thanks to our elegant architectural decomposition, OpenTinker reduces the time to develop a new RL training pipeline by at least an order of magnitude. ⚡ Turn Idle GPU Compute into Gold Small labs often have underutilized hardware. OpenTinker turns your idle GPUs into an internal/external API service for - RL Training - SFT - Inference 🎯 Who needs OpenTinker? - Researchers tired of infrastructure hell. - Labs needing to standardize workflows. - Teams wanting to maximize hardware ROI. Thanks my amazing PhD student Siqi Zhu for leading the project. We are building the future of open RL infra. Be the first to build with us. 👇 Start Building with OpenTinker Now 🚀 Repo: 🌐 Blog: If you believe RL should be accessible to everyone, give us a star, repost this 🔄 post, and let us know what agents you plan to build!

🚨 RL for LLMs is finally accessible. Introducing OpenTinker: The first community-driven, open-source framework designed to democratize Reinforcement Learning for LLMs. Inspired by Thinking Machines's amazing Tinker, we realize the biggest bottleneck in agentic LLM research isn’t the math—it’s the setup. Current RL pipelines are messy. Configuring VeRL for every single experiment is a productivity killer. OpenTinker fixed it. 🛠 How OpenTinker Works: Decoupled Design of Server and Client - Setup Once, Run Forever: Configure the OpenTinker backend on your GPU cluster once. - Develop Locally: Define your RL environments directly on your laptop. - Train on the Cloud: Simply point your local client to the backend. The cluster handles the compute; you handle the science. 📉 The 10x Development Efficiency Thanks to our elegant architectural decomposition, OpenTinker reduces the time to develop a new RL training pipeline by at least an order of magnitude. ⚡ Turn Idle GPU Compute into Gold Small labs often have underutilized hardware. OpenTinker turns your idle GPUs into an internal/external API service for - RL Training - SFT - Inference 🎯 Who needs OpenTinker? - Researchers tired of infrastructure hell. - Labs needing to standardize workflows. - Teams wanting to maximize hardware ROI. Thanks my amazing PhD student Siqi Zhu for leading the project. We are building the future of open RL infra. Be the first to build with us. 👇 Start Building with OpenTinker Now 🚀 Repo: 🌐 Blog: If you believe RL should be accessible to everyone, give us a star, repost this 🔄 post, and let us know what agents you plan to build!

Jiaxuan You

58,120 views • 6 months ago

rl hate the fact i had to fight sb i had real love for and called my sister but shit is what it is

rl hate the fact i had to fight sb i had real love for and called my sister but shit is what it is

K.

189,232 views • 1 year ago

DPO Debate: Is RL needed for RLHF? All things as we cannot settle if DPO or RL is better. At least it is a good exercise. 1. Derivations in the DPO paper. Hint, the authors are good at math 2. cDPO, IPO, and related equations 3. Speculation on potential oddities of DPO vs RL 4. Reminders on the state of open RLHF tldr: we have more limitations with data and tooling and evaluation than optimizer choice Slides: Recent blog post of mine on DPO (more next Wed.): DPO Paper: On youtube:

DPO Debate: Is RL needed for RLHF? All things as we cannot settle if DPO or RL is better. At least it is a good exercise. 1. Derivations in the DPO paper. Hint, the authors are good at math 2. cDPO, IPO, and related equations 3. Speculation on potential oddities of DPO vs RL 4. Reminders on the state of open RLHF tldr: we have more limitations with data and tooling and evaluation than optimizer choice Slides: Recent blog post of mine on DPO (more next Wed.): DPO Paper: On youtube:

Nathan Lambert

100,027 views • 2 years ago

further reward crafting for hand-centric (tm) PPO model. Closing in on what I want. need to slow down max velocities and I want more stepping/stride + less tip toe on left foot ideally. almost cant believe this is actually just RL!

further reward crafting for hand-centric (tm) PPO model. Closing in on what I want. need to slow down max velocities and I want more stepping/stride + less tip toe on left foot ideally. almost cant believe this is actually just RL!

Harrison Kinsley

15,973 views • 6 months ago

"One of the very confusing things about the models right now: how to reconcile the fact that they are doing so well on evals. And you look at the evals and you go, 'Those are pretty hard evals.' But the economic impact seems to be dramatically behind. There is [a possible] explanation. Back when people were doing pre-training, the question of what data to train on was answered, because that answer was everything. So you don't have to think if it's going to be this data or that data. When people do RL training, they say, 'Okay, we want to have this kind of RL training for this thing and that kind of RL training for that thing.' You say, 'Hey, I would love our model to do really well when we release it. I want the evals to look great. What would be RL training that could help on this task?' If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance"

"One of the very confusing things about the models right now: how to reconcile the fact that they are doing so well on evals. And you look at the evals and you go, 'Those are pretty hard evals.' But the economic impact seems to be dramatically behind. There is [a possible] explanation. Back when people were doing pre-training, the question of what data to train on was answered, because that answer was everything. So you don't have to think if it's going to be this data or that data. When people do RL training, they say, 'Okay, we want to have this kind of RL training for this thing and that kind of RL training for that thing.' You say, 'Hey, I would love our model to do really well when we release it. I want the evals to look great. What would be RL training that could help on this task?' If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance"

Dwarkesh Patel

502,037 views • 7 months ago