Name: Does LLM RL post-training need to be on-policy?
Uploaded: 2026-02-27T19:47:18.000Z
Duration: PT13.850S
Channel: Kianté Brantley
Description: Kianté Brantley shorts video about Does LLM RL post-training need to be on-policy?

Does LLM RL post-training need to be on-policy?

Kianté Brantley

113,263 Aufrufe • vor 3 Monaten

LLM training on RTX 5090

ℏεsam

144,051 Aufrufe • vor 1 Jahr

Zombie robot RL policy

Simon Kalouche

164,319 Aufrufe • vor 11 Monaten

Does off-policy value-based RL scale? In LLMs, larger scale... show more

Oleg Rybkin

23,968 Aufrufe • vor 1 Jahr

Tutorial Time: Run any open-source LLM locally. Now we... show more

Linus ✦ Ekenstam

915,786 Aufrufe • vor 2 Jahren

Reinforcement learning should be able to improve upon behaviors... show more

Vivek Myers

79,514 Aufrufe • vor 1 Jahr

New project! Flow Policy Gradients for Robot Control tldr;... show more

Brent Yi

73,790 Aufrufe • vor 4 Monaten

How thick does ice need to be before you... show more

AlphaFox

43,053 Aufrufe • vor 4 Monaten

Does your bed board need to be replaced 😏

sytoys-us1

49,674 Aufrufe • vor 8 Monaten

Young man does not say "Thank you Sir" will... show more

newbeginning

12,898 Aufrufe • vor 1 Jahr

How it feels to be an LLM

Beff (e/acc)

36,669 Aufrufe • vor 1 Jahr

How does high-fidelity tactile simulation help robots nail the... show more

Binghao Huang

46,967 Aufrufe • vor 7 Monaten

i need to post on here more oops

haley ⋆✧.*

14,443 Aufrufe • vor 1 Jahr

i need to post more on here 😩

MS. F!NEE $HITT

124,925 Aufrufe • vor 1 Jahr

🚨Current scalable RL algos train a policy w/o value... show more

Aviral Kumar

37,286 Aufrufe • vor 1 Jahr

RL is painfully slow 😭 — bottlenecked by super-long... CoT rollout. 🔭 Sparse attention should help, but naive sparse rollout hits a brutal efficiency–stability tradeoff: A tedious trial-and-error sparsity sweep for each dense policy is required before an actual RL run. 🐤Sparrow chirps no more pain! Introduce Sparrow: Sparse Rollout for stable and efficient long-context RL. Sparrow finds that: 💡As long as we keep the tail distribution mismatch throughout the sparse rollout above a critical threshold, the RL training will be stable. 💡Even cooler! Through comprehensive control studies of Qwen3-1.7B, 4B, 8B thinking models RL with 40K rollout max length, the critical threshold stays constant across model sizes. 💡Sparrow then finds the optimal dynamic sparse schedule to reach the threshold with minimal cost. 💡Sparrow's findings are empirically validated to generalize in Qwen3-14B, and hold on both Math and Coding RL. 🐤Sparrow empirically helps achieve 2.2× / 2.4× / 2.0× rollout speedup on Qwen3 1.7B / 4B / 8B thinking models, while keeping training stability over extended RL steps. We release the 🐤bird in the following formats. [1/n] Paper: Code: Blog:show more

Infini-AI-Lab

75,287 Aufrufe • vor 3 Tagen

🤔 How to fine-tune an Imitation Learning policy (e.g.,... Diffusion Policy, ACT) with RL? As an RL practitioner, I’ve been struggling with this problem for a while. Here’s why it’s tough: 1️⃣ Special designs (usually for multimodal action distributions) in modern IL models make them non-trivial to fine-tune by RL. 2️⃣ Large policy models + RL's poor sample efficiency = a nightmare But finally, we figured out a simple solution that works for any model architecture! 🌟 Check out our #ICLR2025 paper: “Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models”, led by my amazing mentee Xiu Yuan. 🔗 🧵 Read more below!show more

Tongzhou Mu 🤖🦾🦿

16,923 Aufrufe • vor 1 Jahr

I love this guy. He does not need to... show more

Dex

35,821 Aufrufe • vor 1 Jahr

Introducing RL Environment Creator Skill Now any one can... create RL environments $ npx skills add adithya-s-k/RL_Envs_101 > You can create environments across multiple frameworks like OpenEnv, OpenReward, Verifiers, NemoGym ... > the repo has live working examples of environments that your coding agent can reference > The skill is design to first understand what type of model you are training and create an environment while keeping that in mind ps. There’s a lot more to building RL environments that can be used for training. One major aspect is the data, which this skill can’t directly solve. However, the skill will help with implementing tools, rewards, and other components of an RL environment, making it easier to go from idea to implementation quickly across different frameworks. Let me know if you’d be interested in a detailed, end-to-end blog/tutorial on building an environment and actually training a model for a useful use case.show more

Adithya S K

46,445 Aufrufe • vor 1 Monat

POST THIS ON WV…I NEED TO SEE TAEHYUNG’S REACTION 😭

taehyꪜng

23,053 Aufrufe • vor 1 Jahr

Live Cam