正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

🚨Current scalable RL algos train a policy w/o value func, which is limiting with learning in open-ended, non-stationary, dynamic environments. But, how to scale value-based RL with more data/compute is unclear... Not anymore: presenting scaling laws for value-based RL 🧵⬇️

Aviral Kumar

6,020 subscribers

37,301 次观看 • 1 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!

Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!

Oleg Rybkin

23,979 次观看 • 1 年前

🤔 How to fine-tune an Imitation Learning policy (e.g., Diffusion Policy, ACT) with RL? As an RL practitioner, I’ve been struggling with this problem for a while. Here’s why it’s tough: 1️⃣ Special designs (usually for multimodal action distributions) in modern IL models make them non-trivial to fine-tune by RL. 2️⃣ Large policy models + RL's poor sample efficiency = a nightmare But finally, we figured out a simple solution that works for any model architecture! 🌟 Check out our #ICLR2025 paper: “Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models”, led by my amazing mentee Xiu Yuan. 🔗 🧵 Read more below!

🤔 How to fine-tune an Imitation Learning policy (e.g., Diffusion Policy, ACT) with RL? As an RL practitioner, I’ve been struggling with this problem for a while. Here’s why it’s tough: 1️⃣ Special designs (usually for multimodal action distributions) in modern IL models make them non-trivial to fine-tune by RL. 2️⃣ Large policy models + RL's poor sample efficiency = a nightmare But finally, we figured out a simple solution that works for any model architecture! 🌟 Check out our #ICLR2025 paper: “Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models”, led by my amazing mentee Xiu Yuan. 🔗 🧵 Read more below!

Tongzhou Mu 🤖🦾🦿

16,959 次观看 • 1 年前

D4RL is a great benchmark, but is saturated. Introducing OGBench, a new benchmark for offline goal-conditioned RL and offline RL! Tasks include HumanoidMaze, Puzzle, Drawing, and more 🙂 Project page: GitHub: 🧵↓

D4RL is a great benchmark, but is saturated. Introducing OGBench, a new benchmark for offline goal-conditioned RL and offline RL! Tasks include HumanoidMaze, Puzzle, Drawing, and more 🙂 Project page: GitHub: 🧵↓

Seohong Park

36,410 次观看 • 1 年前

Repair based on the current value.

Repair based on the current value.

Gisa w'I Rwanda🇷🇼❤️🇷🇼

107,536 次观看 • 17 天前

New work: The Value Axis 🎯 How do LLMs choose which path to take mid-task? We find they internally track the chance of reaching their goal along a linear axis, akin to a value function in RL. We show it modulates confidence in math & coding and can be reshaped with DPO and SFT.

New work: The Value Axis 🎯 How do LLMs choose which path to take mid-task? We find they internally track the chance of reaching their goal along a linear axis, akin to a value function in RL. We show it modulates confidence in math & coding and can be reshaped with DPO and SFT.

Nick Jiang

25,071 次观看 • 14 天前

This figure from HIL-SERL is one of the clearest visualisations of how RL learns differently from imitation learning. The difference comes down to this: imitation learning treats each (state, action) pair as independent. A correction at timestep 20 teaches nothing about timestep 19 or 21. RL propagates reward backward through time. One successful insertion updates the value estimate of every state along the trajectory. So RL builds a full map of "which states lead to success"; imitation learning just memorizes individual snapshots. Setup: a robot inserting a RAM stick into a motherboard slot. Each dot is an end-effector position (Y = lateral, Z = height). Starting position is randomized. Left to right = training progressing. Top row (RL): the policy builds a funnel. Broad at the top, narrowing into the target. It systematically fills in the state space, learning which paths lead to success from many different starting positions. Bottom row (imitation learning / HG-DAgger, same human data): sparse, diffuse, no funnel. The policy only learns near states the human demonstrated. Both have access to the same data, including human corrections, but a completely different structure emerges.

This figure from HIL-SERL is one of the clearest visualisations of how RL learns differently from imitation learning. The difference comes down to this: imitation learning treats each (state, action) pair as independent. A correction at timestep 20 teaches nothing about timestep 19 or 21. RL propagates reward backward through time. One successful insertion updates the value estimate of every state along the trajectory. So RL builds a full map of "which states lead to success"; imitation learning just memorizes individual snapshots. Setup: a robot inserting a RAM stick into a motherboard slot. Each dot is an end-effector position (Y = lateral, Z = height). Starting position is randomized. Left to right = training progressing. Top row (RL): the policy builds a funnel. Broad at the top, narrowing into the target. It systematically fills in the state space, learning which paths lead to success from many different starting positions. Bottom row (imitation learning / HG-DAgger, same human data): sparse, diffuse, no funnel. The policy only learns near states the human demonstrated. Both have access to the same data, including human corrections, but a completely different structure emerges.

Dominique Paul

24,433 次观看 • 4 个月前

RL is back! But is it always the best choice? In a new paper, we investigate under what circumstances neuroevolution outperforms RL in transfer learning tasks. See more details in the thread below 🧵 While NE performs best in simpler domains, it will be interesting to see if the lessons learned here can also be applied to more complex systems/tasks (LLMs?).

RL is back! But is it always the best choice? In a new paper, we investigate under what circumstances neuroevolution outperforms RL in transfer learning tasks. See more details in the thread below 🧵 While NE performs best in simpler domains, it will be interesting to see if the lessons learned here can also be applied to more complex systems/tasks (LLMs?).

Sebastian Risi

11,130 次观看 • 11 个月前

Frontier research just crossed a new threshold. Mind Lab is now public. And it begins with trillion-scale RL trained at only 10% of the usual compute.

Frontier research just crossed a new threshold. Mind Lab is now public. And it begins with trillion-scale RL trained at only 10% of the usual compute.

Chidanand Tripathi

89,813 次观看 • 6 个月前

🚨 New: Integrating Harbor (Harbor Framework) for end-to-end Computer-Use evaluation(for Windows and Linux) at scale with Thinking Machines' Tinker, OSWorld, Daytona, and bare-metal servers. We just added support for Computer Use, Tinker, and OSWorld to Harbor - a framework for evaluating agents and generating RL training data by running large-scale rollouts across parallel sandboxed environments and collecting trajectories for SFT and RL. Repo and blogpost below 👇

🚨 New: Integrating Harbor (Harbor Framework) for end-to-end Computer-Use evaluation(for Windows and Linux) at scale with Thinking Machines' Tinker, OSWorld, Daytona, and bare-metal servers. We just added support for Computer Use, Tinker, and OSWorld to Harbor - a framework for evaluating agents and generating RL training data by running large-scale rollouts across parallel sandboxed environments and collecting trajectories for SFT and RL. Repo and blogpost below 👇

Marco Mascorro

19,448 次观看 • 3 个月前

RL X-mas came early. 🎄 For too long, building powerful AI agents with Reinforcement Learning has been blocked by GPU scarcity and complex infrastructure. That ends today. Introducing Serverless RL from wandb, powered by CoreWeave! We're making RL accessible to all.

RL X-mas came early. 🎄 For too long, building powerful AI agents with Reinforcement Learning has been blocked by GPU scarcity and complex infrastructure. That ends today. Introducing Serverless RL from wandb, powered by CoreWeave! We're making RL accessible to all.

Weights & Biases

112,643 次观看 • 8 个月前

Over the past months, Cohort I of our RL Residency has been shipping. Highlights - continual learning - automating AI research (from GPU programming to RL itself) - embodied environments - multi-agent systems - materials science discovery

Over the past months, Cohort I of our RL Residency has been shipping. Highlights - continual learning - automating AI research (from GPU programming to RL itself) - embodied environments - multi-agent systems - materials science discovery

Prime Intellect

59,409 次观看 • 2 个月前

Zombie robot RL policy

Zombie robot RL policy

Simon Kalouche

164,401 次观看 • 1 年前

🔥 Nebius AI R&D is hiring AI Research Interns for short, high-impact RL projects. Exclusive to X right now — no LinkedIn mass postings yet. In 2019, I was a fresh dental grad with 3 months of runway left, begging for an AI shot. I know the grind. We’re looking for sharp early-career folks (students, grads, career-switchers) to join us and work on: > Agent trajectories analysis at scale > Long-horizon tasks for coding agents > Pushing open RL environments > Any other data / RL env / eval project that will benefit open-source community What you get: 💰 Fully paid internship (3-6 month) 📦 100% open-source shipping 📄 Co-author research papers ⚡️ Access to Nebius compute infra 🌍 Remote-friendly (EU/US) or Amsterdam/London/other office. If you’ve done any cool AI/ML/RL stuff, dm me with your most impressive project + 1-sentence summary + cv Sharing appreciated!🤝

🔥 Nebius AI R&D is hiring AI Research Interns for short, high-impact RL projects. Exclusive to X right now — no LinkedIn mass postings yet. In 2019, I was a fresh dental grad with 3 months of runway left, begging for an AI shot. I know the grind. We’re looking for sharp early-career folks (students, grads, career-switchers) to join us and work on: > Agent trajectories analysis at scale > Long-horizon tasks for coding agents > Pushing open RL environments > Any other data / RL env / eval project that will benefit open-source community What you get: 💰 Fully paid internship (3-6 month) 📦 100% open-source shipping 📄 Co-author research papers ⚡️ Access to Nebius compute infra 🌍 Remote-friendly (EU/US) or Amsterdam/London/other office. If you’ve done any cool AI/ML/RL stuff, dm me with your most impressive project + 1-sentence summary + cv Sharing appreciated!🤝

Ibragim

33,427 次观看 • 2 个月前

Introducing RL Environment Creator Skill Now any one can create RL environments $ npx skills add adithya-s-k/RL_Envs_101 > You can create environments across multiple frameworks like OpenEnv, OpenReward, Verifiers, NemoGym ... > the repo has live working examples of environments that your coding agent can reference > The skill is design to first understand what type of model you are training and create an environment while keeping that in mind ps. There’s a lot more to building RL environments that can be used for training. One major aspect is the data, which this skill can’t directly solve. However, the skill will help with implementing tools, rewards, and other components of an RL environment, making it easier to go from idea to implementation quickly across different frameworks. Let me know if you’d be interested in a detailed, end-to-end blog/tutorial on building an environment and actually training a model for a useful use case.

Introducing RL Environment Creator Skill Now any one can create RL environments $ npx skills add adithya-s-k/RL_Envs_101 > You can create environments across multiple frameworks like OpenEnv, OpenReward, Verifiers, NemoGym ... > the repo has live working examples of environments that your coding agent can reference > The skill is design to first understand what type of model you are training and create an environment while keeping that in mind ps. There’s a lot more to building RL environments that can be used for training. One major aspect is the data, which this skill can’t directly solve. However, the skill will help with implementing tools, rewards, and other components of an RL environment, making it easier to go from idea to implementation quickly across different frameworks. Let me know if you’d be interested in a detailed, end-to-end blog/tutorial on building an environment and actually training a model for a useful use case.

Adithya S K

46,556 次观看 • 1 个月前

Twitter AU in which Boss and Noeul attempt to navigate their fame, friendship, and dating rumors while being under public scrutiny. Note: first time doing this and still learning. Also this story will not be in RL order, but will incorporate RL moments. #BoNoh #BossNoeul

Twitter AU in which Boss and Noeul attempt to navigate their fame, friendship, and dating rumors while being under public scrutiny. Note: first time doing this and still learning. Also this story will not be in RL order, but will incorporate RL moments. #BoNoh #BossNoeul

MayflowerPrincess

20,606 次观看 • 2 年前

Some personal news: I recently joined Cursor. Cursor is a small, ambitious team, and they’ve created my favorite AI systems. We’re now building frontier RL models at scale in real-world coding environments. Excited for how good coding is going to be.

Some personal news: I recently joined Cursor. Cursor is a small, ambitious team, and they’ve created my favorite AI systems. We’re now building frontier RL models at scale in real-world coding environments. Excited for how good coding is going to be.

Sasha Rush

336,077 次观看 • 1 年前

New research from Databricks: LLMs Can Learn to Reason via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes.

New research from Databricks: LLMs Can Learn to Reason via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes.

Databricks AI Research

12,539 次观看 • 4 个月前

he is ready to throw hands for rl 😭

he is ready to throw hands for rl 😭

ara ྀི

38,189 次观看 • 9 个月前

US-based K-Scale Labs launched pre-orders for its open-source humanoid, priced at $9K The K-Bot stands 4′7″, weighs 77lbs, and integrates with K-Scale's open-source stack - covering RL/VLA models, Rust firmware, sim2real pipeline, and hardware design

US-based K-Scale Labs launched pre-orders for its open-source humanoid, priced at $9K The K-Bot stands 4′7″, weighs 77lbs, and integrates with K-Scale's open-source stack - covering RL/VLA models, Rust firmware, sim2real pipeline, and hardware design

Brett Adcock

10,994 次观看 • 1 年前

How to return a value based on a criteria using the IF function. 🤓 #excel

How to return a value based on a criteria using the IF function. 🤓 #excel

Excel Dictionary

153,781 次观看 • 3 年前