Name: 🚨Current scalable RL algos train a policy w/o value func, which is limiting with learning in open-ended, non-stationary, dynamic environments. But, how to scale value-based RL with more data/compute is unclear... Not anymore: presenting scaling laws for value-based RL 🧵⬇️
Uploaded: 2025-02-07T07:26:05.000Z
Duration: PT10.428S
Channel: Aviral Kumar
Description: Aviral Kumar shorts video about 🚨Current scalable RL algos train a policy w/o value func, which is limiting with learning in open-ended, non-stationary, dynamic environments. But, how to scale value-based RL with more data/compute is unclear... Not anymore: presenting scaling laws for value-based RL 🧵⬇️

🚨Current scalable RL algos train a policy w/o value... show more

Aviral Kumar

37,286 görüntüleme • 1 yıl önce

Does off-policy value-based RL scale? In LLMs, larger scale... show more

Oleg Rybkin

23,968 görüntüleme • 1 yıl önce

🤔 How to fine-tune an Imitation Learning policy (e.g.,... Diffusion Policy, ACT) with RL? As an RL practitioner, I’ve been struggling with this problem for a while. Here’s why it’s tough: 1️⃣ Special designs (usually for multimodal action distributions) in modern IL models make them non-trivial to fine-tune by RL. 2️⃣ Large policy models + RL's poor sample efficiency = a nightmare But finally, we figured out a simple solution that works for any model architecture! 🌟 Check out our #ICLR2025 paper: “Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models”, led by my amazing mentee Xiu Yuan. 🔗 🧵 Read more below!show more

Tongzhou Mu 🤖🦾🦿

16,923 görüntüleme • 1 yıl önce

D4RL is a great benchmark, but is saturated. Introducing... show more

Seohong Park

36,410 görüntüleme • 1 yıl önce

This figure from HIL-SERL is one of the clearest... visualisations of how RL learns differently from imitation learning. The difference comes down to this: imitation learning treats each (state, action) pair as independent. A correction at timestep 20 teaches nothing about timestep 19 or 21. RL propagates reward backward through time. One successful insertion updates the value estimate of every state along the trajectory. So RL builds a full map of "which states lead to success"; imitation learning just memorizes individual snapshots. Setup: a robot inserting a RAM stick into a motherboard slot. Each dot is an end-effector position (Y = lateral, Z = height). Starting position is randomized. Left to right = training progressing. Top row (RL): the policy builds a funnel. Broad at the top, narrowing into the target. It systematically fills in the state space, learning which paths lead to success from many different starting positions. Bottom row (imitation learning / HG-DAgger, same human data): sparse, diffuse, no funnel. The policy only learns near states the human demonstrated. Both have access to the same data, including human corrections, but a completely different structure emerges.show more

Dominique Paul

24,399 görüntüleme • 3 ay önce

RL is back! But is it always the best... show more

Sebastian Risi

11,130 görüntüleme • 10 ay önce

Frontier research just crossed a new threshold. Mind Lab... show more

Chidanand Tripathi

89,813 görüntüleme • 6 ay önce

🚨 New: Integrating Harbor (Harbor Framework) for end-to-end Computer-Use... show more

Marco Mascorro

19,448 görüntüleme • 3 ay önce

RL X-mas came early. 🎄 For too long, building... show more

Weights & Biases

112,643 görüntüleme • 8 ay önce

Over the past months, Cohort I of our RL... show more

Prime Intellect

59,206 görüntüleme • 1 ay önce

Zombie robot RL policy

Simon Kalouche

164,319 görüntüleme • 11 ay önce

🔥 Nebius AI R&D is hiring AI Research Interns... for short, high-impact RL projects. Exclusive to X right now — no LinkedIn mass postings yet. In 2019, I was a fresh dental grad with 3 months of runway left, begging for an AI shot. I know the grind. We’re looking for sharp early-career folks (students, grads, career-switchers) to join us and work on: > Agent trajectories analysis at scale > Long-horizon tasks for coding agents > Pushing open RL environments > Any other data / RL env / eval project that will benefit open-source community What you get: 💰 Fully paid internship (3-6 month) 📦 100% open-source shipping 📄 Co-author research papers ⚡️ Access to Nebius compute infra 🌍 Remote-friendly (EU/US) or Amsterdam/London/other office. If you’ve done any cool AI/ML/RL stuff, dm me with your most impressive project + 1-sentence summary + cv Sharing appreciated!🤝show more

Ibragim

33,337 görüntüleme • 1 ay önce

Twitter AU in which Boss and Noeul attempt to... show more

MayflowerPrincess

20,561 görüntüleme • 2 yıl önce

Introducing RL Environment Creator Skill Now any one can... create RL environments $ npx skills add adithya-s-k/RL_Envs_101 > You can create environments across multiple frameworks like OpenEnv, OpenReward, Verifiers, NemoGym ... > the repo has live working examples of environments that your coding agent can reference > The skill is design to first understand what type of model you are training and create an environment while keeping that in mind ps. There’s a lot more to building RL environments that can be used for training. One major aspect is the data, which this skill can’t directly solve. However, the skill will help with implementing tools, rewards, and other components of an RL environment, making it easier to go from idea to implementation quickly across different frameworks. Let me know if you’d be interested in a detailed, end-to-end blog/tutorial on building an environment and actually training a model for a useful use case.show more

Adithya S K

46,445 görüntüleme • 1 ay önce

Some personal news: I recently joined Cursor. Cursor is... show more

Sasha Rush

335,873 görüntüleme • 1 yıl önce

New research from Databricks: LLMs Can Learn to Reason... via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes.show more