Name: Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!
Uploaded: 2025-02-10T18:21:38.000Z
Duration: PT16.383S
Channel: Oleg Rybkin
Description: Oleg Rybkin shorts video about Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!

Does off-policy value-based RL scale? In LLMs, larger scale... show more

Oleg Rybkin

23,968 views • 1 year ago

🚨Current scalable RL algos train a policy w/o value... show more

Aviral Kumar

37,286 views • 1 year ago

Introducing CQN: Coarse-to-fine Q-Network, a value-based RL algorithm for... show more

Younggyo Seo

16,413 views • 1 year ago

New research from Databricks: LLMs Can Learn to Reason... via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes.show more

Databricks AI Research

12,439 views • 3 months ago

US-based K-Scale Labs launched pre-orders for its open-source humanoid,... show more

Brett Adcock

10,994 views • 11 months ago

1/ While most RL methods use shallow MLPs (~2–5... show more

Kevin Wang

154,580 views • 1 year ago

Your value doesn't decrease based on someone's inability to... show more

Persephanii Aka Thick Yonce

380,674 views • 2 years ago

Why am I working on RL for LLMs, when... show more

Shane Gu

12,440 views • 6 months ago

Frontier research just crossed a new threshold. Mind Lab... show more

Chidanand Tripathi

89,813 views • 6 months ago

Thanks AK! Finally, robot can do continuous, agile, autonomous,... show more

Guanya Shi

32,142 views • 1 year ago

RL is back! But is it always the best... show more

Sebastian Risi

11,130 views • 10 months ago

Does LLM RL post-training need to be on-policy?

Kianté Brantley

113,263 views • 3 months ago

X_Acc_Flags - for ios The value of ‘Location (accurate)’... show more

 CrazyMind

19,392 views • 6 months ago

"I think just based on vehicle autonomy, we can... show more

DogeDesigner

37,893 views • 11 months ago

Crypto can’t scale without solving value transfer. Not just... show more

Kima Network

25,026 views • 11 months ago

Polygon 2.0 is a concrete vision to build the... show more

Polygon | POL

72,469 views • 3 years ago

Pay for what you create, not for seats. FLORA's... show more

FLORA ©

22,998 views • 4 months ago

This figure from HIL-SERL is one of the clearest... visualisations of how RL learns differently from imitation learning. The difference comes down to this: imitation learning treats each (state, action) pair as independent. A correction at timestep 20 teaches nothing about timestep 19 or 21. RL propagates reward backward through time. One successful insertion updates the value estimate of every state along the trajectory. So RL builds a full map of "which states lead to success"; imitation learning just memorizes individual snapshots. Setup: a robot inserting a RAM stick into a motherboard slot. Each dot is an end-effector position (Y = lateral, Z = height). Starting position is randomized. Left to right = training progressing. Top row (RL): the policy builds a funnel. Broad at the top, narrowing into the target. It systematically fills in the state space, learning which paths lead to success from many different starting positions. Bottom row (imitation learning / HG-DAgger, same human data): sparse, diffuse, no funnel. The policy only learns near states the human demonstrated. Both have access to the same data, including human corrections, but a completely different structure emerges.show more