Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

🚨Current scalable RL algos train a policy w/o value func, which is limiting with learning in open-ended, non-stationary, dynamic environments. But, how to scale value-based RL with more data/compute is unclear... Not anymore: presenting scaling laws for value-based RL 🧵⬇️

Aviral Kumar

6,020 subscribers

37,301 views • 1 year ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!

Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!

Oleg Rybkin

23,979 views • 1 year ago

🤔 How to fine-tune an Imitation Learning policy (e.g., Diffusion Policy, ACT) with RL? As an RL practitioner, I’ve been struggling with this problem for a while. Here’s why it’s tough: 1️⃣ Special designs (usually for multimodal action distributions) in modern IL models make them non-trivial to fine-tune by RL. 2️⃣ Large policy models + RL's poor sample efficiency = a nightmare But finally, we figured out a simple solution that works for any model architecture! 🌟 Check out our #ICLR2025 paper: “Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models”, led by my amazing mentee Xiu Yuan. 🔗 🧵 Read more below!

🤔 How to fine-tune an Imitation Learning policy (e.g., Diffusion Policy, ACT) with RL? As an RL practitioner, I’ve been struggling with this problem for a while. Here’s why it’s tough: 1️⃣ Special designs (usually for multimodal action distributions) in modern IL models make them non-trivial to fine-tune by RL. 2️⃣ Large policy models + RL's poor sample efficiency = a nightmare But finally, we figured out a simple solution that works for any model architecture! 🌟 Check out our #ICLR2025 paper: “Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models”, led by my amazing mentee Xiu Yuan. 🔗 🧵 Read more below!

Tongzhou Mu 🤖🦾🦿

16,959 views • 1 year ago

D4RL is a great benchmark, but is saturated. Introducing OGBench, a new benchmark for offline goal-conditioned RL and offline RL! Tasks include HumanoidMaze, Puzzle, Drawing, and more 🙂 Project page: GitHub: 🧵↓

D4RL is a great benchmark, but is saturated. Introducing OGBench, a new benchmark for offline goal-conditioned RL and offline RL! Tasks include HumanoidMaze, Puzzle, Drawing, and more 🙂 Project page: GitHub: 🧵↓

Seohong Park

36,410 views • 1 year ago

Repair based on the current value.

Repair based on the current value.

Gisa w'I Rwanda🇷🇼❤️🇷🇼

107,536 views • 15 days ago

New work: The Value Axis 🎯 How do LLMs choose which path to take mid-task? We find they internally track the chance of reaching their goal along a linear axis, akin to a value function in RL. We show it modulates confidence in math & coding and can be reshaped with DPO and SFT.

New work: The Value Axis 🎯 How do LLMs choose which path to take mid-task? We find they internally track the chance of reaching their goal along a linear axis, akin to a value function in RL. We show it modulates confidence in math & coding and can be reshaped with DPO and SFT.

Nick Jiang

25,071 views • 12 days ago

This figure from HIL-SERL is one of the clearest visualisations of how RL learns differently from imitation learning. The difference comes down to this: imitation learning treats each (state, action) pair as independent. A correction at timestep 20 teaches nothing about timestep 19 or 21. RL propagates reward backward through time. One successful insertion updates the value estimate of every state along the trajectory. So RL builds a full map of "which states lead to success"; imitation learning just memorizes individual snapshots. Setup: a robot inserting a RAM stick into a motherboard slot. Each dot is an end-effector position (Y = lateral, Z = height). Starting position is randomized. Left to right = training progressing. Top row (RL): the policy builds a funnel. Broad at the top, narrowing into the target. It systematically fills in the state space, learning which paths lead to success from many different starting positions. Bottom row (imitation learning / HG-DAgger, same human data): sparse, diffuse, no funnel. The policy only learns near states the human demonstrated. Both have access to the same data, including human corrections, but a completely different structure emerges.

This figure from HIL-SERL is one of the clearest visualisations of how RL learns differently from imitation learning. The difference comes down to this: imitation learning treats each (state, action) pair as independent. A correction at timestep 20 teaches nothing about timestep 19 or 21. RL propagates reward backward through time. One successful insertion updates the value estimate of every state along the trajectory. So RL builds a full map of "which states lead to success"; imitation learning just memorizes individual snapshots. Setup: a robot inserting a RAM stick into a motherboard slot. Each dot is an end-effector position (Y = lateral, Z = height). Starting position is randomized. Left to right = training progressing. Top row (RL): the policy builds a funnel. Broad at the top, narrowing into the target. It systematically fills in the state space, learning which paths lead to success from many different starting positions. Bottom row (imitation learning / HG-DAgger, same human data): sparse, diffuse, no funnel. The policy only learns near states the human demonstrated. Both have access to the same data, including human corrections, but a completely different structure emerges.

Dominique Paul

24,433 views • 4 months ago

RL is back! But is it always the best choice? In a new paper, we investigate under what circumstances neuroevolution outperforms RL in transfer learning tasks. See more details in the thread below 🧵 While NE performs best in simpler domains, it will be interesting to see if the lessons learned here can also be applied to more complex systems/tasks (LLMs?).

RL is back! But is it always the best choice? In a new paper, we investigate under what circumstances neuroevolution outperforms RL in transfer learning tasks. See more details in the thread below 🧵 While NE performs best in simpler domains, it will be interesting to see if the lessons learned here can also be applied to more complex systems/tasks (LLMs?).

Sebastian Risi

11,130 views • 11 months ago

Frontier research just crossed a new threshold. Mind Lab is now public. And it begins with trillion-scale RL trained at only 10% of the usual compute.

Frontier research just crossed a new threshold. Mind Lab is now public. And it begins with trillion-scale RL trained at only 10% of the usual compute.

Chidanand Tripathi

89,813 views • 6 months ago

🚨 New: Integrating Harbor (Harbor Framework) for end-to-end Computer-Use evaluation(for Windows and Linux) at scale with Thinking Machines' Tinker, OSWorld, Daytona, and bare-metal servers. We just added support for Computer Use, Tinker, and OSWorld to Harbor - a framework for evaluating agents and generating RL training data by running large-scale rollouts across parallel sandboxed environments and collecting trajectories for SFT and RL. Repo and blogpost below 👇

🚨 New: Integrating Harbor (Harbor Framework) for end-to-end Computer-Use evaluation(for Windows and Linux) at scale with Thinking Machines' Tinker, OSWorld, Daytona, and bare-metal servers. We just added support for Computer Use, Tinker, and OSWorld to Harbor - a framework for evaluating agents and generating RL training data by running large-scale rollouts across parallel sandboxed environments and collecting trajectories for SFT and RL. Repo and blogpost below 👇

Marco Mascorro

19,448 views • 3 months ago

RL X-mas came early. 🎄 For too long, building powerful AI agents with Reinforcement Learning has been blocked by GPU scarcity and complex infrastructure. That ends today. Introducing Serverless RL from wandb, powered by CoreWeave! We're making RL accessible to all.

RL X-mas came early. 🎄 For too long, building powerful AI agents with Reinforcement Learning has been blocked by GPU scarcity and complex infrastructure. That ends today. Introducing Serverless RL from wandb, powered by CoreWeave! We're making RL accessible to all.

Weights & Biases

112,643 views • 8 months ago

Over the past months, Cohort I of our RL Residency has been shipping. Highlights - continual learning - automating AI research (from GPU programming to RL itself) - embodied environments - multi-agent systems - materials science discovery

Over the past months, Cohort I of our RL Residency has been shipping. Highlights - continual learning - automating AI research (from GPU programming to RL itself) - embodied environments - multi-agent systems - materials science discovery

Prime Intellect

59,409 views • 2 months ago

Zombie robot RL policy

Zombie robot RL policy

Simon Kalouche

164,401 views • 1 year ago

🔥 Nebius AI R&D is hiring AI Research Interns for short, high-impact RL projects. Exclusive to X right now — no LinkedIn mass postings yet. In 2019, I was a fresh dental grad with 3 months of runway left, begging for an AI shot. I know the grind. We’re looking for sharp early-career folks (students, grads, career-switchers) to join us and work on: > Agent trajectories analysis at scale > Long-horizon tasks for coding agents > Pushing open RL environments > Any other data / RL env / eval project that will benefit open-source community What you get: 💰 Fully paid internship (3-6 month) 📦 100% open-source shipping 📄 Co-author research papers ⚡️ Access to Nebius compute infra 🌍 Remote-friendly (EU/US) or Amsterdam/London/other office. If you’ve done any cool AI/ML/RL stuff, dm me with your most impressive project + 1-sentence summary + cv Sharing appreciated!🤝

🔥 Nebius AI R&D is hiring AI Research Interns for short, high-impact RL projects. Exclusive to X right now — no LinkedIn mass postings yet. In 2019, I was a fresh dental grad with 3 months of runway left, begging for an AI shot. I know the grind. We’re looking for sharp early-career folks (students, grads, career-switchers) to join us and work on: > Agent trajectories analysis at scale > Long-horizon tasks for coding agents > Pushing open RL environments > Any other data / RL env / eval project that will benefit open-source community What you get: 💰 Fully paid internship (3-6 month) 📦 100% open-source shipping 📄 Co-author research papers ⚡️ Access to Nebius compute infra 🌍 Remote-friendly (EU/US) or Amsterdam/London/other office. If you’ve done any cool AI/ML/RL stuff, dm me with your most impressive project + 1-sentence summary + cv Sharing appreciated!🤝

Ibragim

33,427 views • 2 months ago

Introducing RL Environment Creator Skill Now any one can create RL environments $ npx skills add adithya-s-k/RL_Envs_101 > You can create environments across multiple frameworks like OpenEnv, OpenReward, Verifiers, NemoGym ... > the repo has live working examples of environments that your coding agent can reference > The skill is design to first understand what type of model you are training and create an environment while keeping that in mind ps. There’s a lot more to building RL environments that can be used for training. One major aspect is the data, which this skill can’t directly solve. However, the skill will help with implementing tools, rewards, and other components of an RL environment, making it easier to go from idea to implementation quickly across different frameworks. Let me know if you’d be interested in a detailed, end-to-end blog/tutorial on building an environment and actually training a model for a useful use case.

Introducing RL Environment Creator Skill Now any one can create RL environments $ npx skills add adithya-s-k/RL_Envs_101 > You can create environments across multiple frameworks like OpenEnv, OpenReward, Verifiers, NemoGym ... > the repo has live working examples of environments that your coding agent can reference > The skill is design to first understand what type of model you are training and create an environment while keeping that in mind ps. There’s a lot more to building RL environments that can be used for training. One major aspect is the data, which this skill can’t directly solve. However, the skill will help with implementing tools, rewards, and other components of an RL environment, making it easier to go from idea to implementation quickly across different frameworks. Let me know if you’d be interested in a detailed, end-to-end blog/tutorial on building an environment and actually training a model for a useful use case.

Adithya S K

46,556 views • 1 month ago

Twitter AU in which Boss and Noeul attempt to navigate their fame, friendship, and dating rumors while being under public scrutiny. Note: first time doing this and still learning. Also this story will not be in RL order, but will incorporate RL moments. #BoNoh #BossNoeul

Twitter AU in which Boss and Noeul attempt to navigate their fame, friendship, and dating rumors while being under public scrutiny. Note: first time doing this and still learning. Also this story will not be in RL order, but will incorporate RL moments. #BoNoh #BossNoeul

MayflowerPrincess

20,606 views • 2 years ago

Some personal news: I recently joined Cursor. Cursor is a small, ambitious team, and they’ve created my favorite AI systems. We’re now building frontier RL models at scale in real-world coding environments. Excited for how good coding is going to be.

Some personal news: I recently joined Cursor. Cursor is a small, ambitious team, and they’ve created my favorite AI systems. We’re now building frontier RL models at scale in real-world coding environments. Excited for how good coding is going to be.

Sasha Rush

336,077 views • 1 year ago

New research from Databricks: LLMs Can Learn to Reason via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes.

New research from Databricks: LLMs Can Learn to Reason via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes.

Databricks AI Research

12,539 views • 4 months ago

he is ready to throw hands for rl 😭

he is ready to throw hands for rl 😭

ara ྀི

38,189 views • 9 months ago

US-based K-Scale Labs launched pre-orders for its open-source humanoid, priced at $9K The K-Bot stands 4′7″, weighs 77lbs, and integrates with K-Scale's open-source stack - covering RL/VLA models, Rust firmware, sim2real pipeline, and hardware design

US-based K-Scale Labs launched pre-orders for its open-source humanoid, priced at $9K The K-Bot stands 4′7″, weighs 77lbs, and integrates with K-Scale's open-source stack - covering RL/VLA models, Rust firmware, sim2real pipeline, and hardware design

Brett Adcock

10,994 views • 11 months ago

How to return a value based on a criteria using the IF function. 🤓 #excel

How to return a value based on a criteria using the IF function. 🤓 #excel

Excel Dictionary

153,781 views • 3 years ago