Uploaded: 2026-06-10T21:07:43.000Z
Duration: PT16.833S
Channel: Infini-AI-Lab

RL is painfully slow 😭 — bottlenecked by super-long... CoT rollout. 🔭 Sparse attention should help, but naive sparse rollout hits a brutal efficiency–stability tradeoff: A tedious trial-and-error sparsity sweep for each dense policy is required before an actual RL run. 🐤Sparrow chirps no more pain! Introduce Sparrow: Sparse Rollout for stable and efficient long-context RL. Sparrow finds that: 💡As long as we keep the tail distribution mismatch throughout the sparse rollout above a critical threshold, the RL training will be stable. 💡Even cooler! Through comprehensive control studies of Qwen3-1.7B, 4B, 8B thinking models RL with 40K rollout max length, the critical threshold stays constant across model sizes. 💡Sparrow then finds the optimal dynamic sparse schedule to reach the threshold with minimal cost. 💡Sparrow's findings are empirically validated to generalize in Qwen3-14B, and hold on both Math and Coding RL. 🐤Sparrow empirically helps achieve 2.2× / 2.4× / 2.0× rollout speedup on Qwen3 1.7B / 4B / 8B thinking models, while keeping training stability over extended RL steps. We release the 🐤bird in the following formats. [1/n] Paper: Code: Blog:show more

Infini-AI-Lab

77,156 görüntüleme • 22 gün önce

Reinforcement learning should be able to improve upon behaviors... show more

Vivek Myers

79,514 görüntüleme • 1 yıl önce

Introducing RL Environment Creator Skill Now any one can... create RL environments $ npx skills add adithya-s-k/RL_Envs_101 > You can create environments across multiple frameworks like OpenEnv, OpenReward, Verifiers, NemoGym ... > the repo has live working examples of environments that your coding agent can reference > The skill is design to first understand what type of model you are training and create an environment while keeping that in mind ps. There’s a lot more to building RL environments that can be used for training. One major aspect is the data, which this skill can’t directly solve. However, the skill will help with implementing tools, rewards, and other components of an RL environment, making it easier to go from idea to implementation quickly across different frameworks. Let me know if you’d be interested in a detailed, end-to-end blog/tutorial on building an environment and actually training a model for a useful use case.show more

Adithya S K

46,556 görüntüleme • 1 ay önce

Frontier research just crossed a new threshold. Mind Lab... show more

Chidanand Tripathi

89,813 görüntüleme • 6 ay önce

🚨Current scalable RL algos train a policy w/o value... show more

Aviral Kumar

37,301 görüntüleme • 1 yıl önce

New research from Databricks: LLMs Can Learn to Reason... via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes.show more

Databricks AI Research

12,539 görüntüleme • 4 ay önce

RL X-mas came early. 🎄 For too long, building... show more

Weights & Biases

112,643 görüntüleme • 8 ay önce

Does LLM RL post-training need to be on-policy?

Kianté Brantley

113,605 görüntüleme • 4 ay önce

RL is back! But is it always the best... show more

Sebastian Risi

11,130 görüntüleme • 11 ay önce

What if you kept asking an LLM to "make... it better"? In some recent work at FAIR, we investigate how we can efficiently use RL to fine-tune LLMs to iteratively self-improve on their previous solutions at inference-time. Training for iterated self-improvement can be costly. The naive approach to training for K self-improvement steps leads to K times the number of rollout steps per episode. We introduce Exploratory Iteration (ExIt), an RL-based automatic curriculum method that bootstraps diverse training distributions of self-improvement tasks by upcycling the LLM's own responses at previous turns as the starting points for both self-improvement and *self-divergence.* In order to decide what task to train on next, the curriculum prioritizes sampling of partial turn histories that led to higher return variance in its GRPO group (a learnability score that comes for free). This automatic curriculum over the bootstrapped task space teaches the model how to perform iterated self-improvement while only ever training the model on single-step self-improvement tasks. We look at ExIt's impact in both single-turn (contest math problems) and multi-turn (BFCLv3 multi-turn tasks), as well as MLE-bench, where the LLM is run in a search scaffold to produce solutions to real Kaggle competitions. Across these eval settings, we find ExIt produces models with greater capacity for inference-time self-improvement compared to GRPO. Notably, ExIt models can self-improve on test tasks for many more steps than the typical solution depth encountered during training, including a 22% improvement in MLE-bench performance compared to GRPO.show more

Minqi Jiang

41,066 görüntüleme • 9 ay önce

fell for a g*rl in the holy month….

lalo

161,678 görüntüleme • 4 ay önce

This figure from HIL-SERL is one of the clearest... visualisations of how RL learns differently from imitation learning. The difference comes down to this: imitation learning treats each (state, action) pair as independent. A correction at timestep 20 teaches nothing about timestep 19 or 21. RL propagates reward backward through time. One successful insertion updates the value estimate of every state along the trajectory. So RL builds a full map of "which states lead to success"; imitation learning just memorizes individual snapshots. Setup: a robot inserting a RAM stick into a motherboard slot. Each dot is an end-effector position (Y = lateral, Z = height). Starting position is randomized. Left to right = training progressing. Top row (RL): the policy builds a funnel. Broad at the top, narrowing into the target. It systematically fills in the state space, learning which paths lead to success from many different starting positions. Bottom row (imitation learning / HG-DAgger, same human data): sparse, diffuse, no funnel. The policy only learns near states the human demonstrated. Both have access to the same data, including human corrections, but a completely different structure emerges.show more

Dominique Paul

24,433 görüntüleme • 4 ay önce

Excited to share CrystalReasoner, a reasoning model for crystal... show more

Sherry Yang

10,740 görüntüleme • 1 ay önce

I've gotten a mujoco sim RL training loop for... show more

kache

29,728 görüntüleme • 22 gün önce

Thanks AK! Finally, robot can do continuous, agile, autonomous,... show more

Guanya Shi

32,155 görüntüleme • 1 yıl önce

PPO has long dominated robot locomotion training in simulation.... show more

Robotic Systems Lab

41,757 görüntüleme • 22 gün önce

GR-RL Going Dexterous and Precise for Long-Horizon Robotic Manipulation

AK

17,630 görüntüleme • 7 ay önce

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos... with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and temporal consistency of generated long-term videos while maintaining efficient inference. • We design a human pose conditioning scheme to enhance the appearance quality and motion accuracy of generated human videos. • We plan to release our processed version of the DNA-Rendering dataset, which we believe will benefit future research in this area.show more