Infini-AI-Lab's banner
Infini-AI-Lab's profile picture

Infini-AI-Lab

@InfiniAILab1,973 subscribers

Shorts

RL is painfully slow 😭 — bottlenecked by super-long CoT rollout. 🔭 Sparse attention should help, but naive sparse rollout hits a brutal efficiency–stability tradeoff: A tedious trial-and-error sparsity sweep for each dense policy is required before an actual RL run. 🐤Sparrow chirps no more pain! Introduce Sparrow: Sparse Rollout for stable and efficient long-context RL. Sparrow finds that: 💡As long as we keep the tail distribution mismatch throughout the sparse rollout above a critical threshold, the RL training will be stable. 💡Even cooler! Through comprehensive control studies of Qwen3-1.7B, 4B, 8B thinking models RL with 40K rollout max length, the critical threshold stays constant across model sizes. 💡Sparrow then finds the optimal dynamic sparse schedule to reach the threshold with minimal cost. 💡Sparrow's findings are empirically validated to generalize in Qwen3-14B, and hold on both Math and Coding RL. 🐤Sparrow empirically helps achieve 2.2× / 2.4× / 2.0× rollout speedup on Qwen3 1.7B / 4B / 8B thinking models, while keeping training stability over extended RL steps. We release the 🐤bird in the following formats. [1/n] Paper: Code: Blog:

RL is painfully slow 😭 — bottlenecked by super-long CoT rollout. 🔭 Sparse attention should help, but naive sparse rollout hits a brutal efficiency–stability tradeoff: A tedious trial-and-error sparsity sweep for each dense policy is required before an actual RL run. 🐤Sparrow chirps no more pain! Introduce Sparrow: Sparse Rollout for stable and efficient long-context RL. Sparrow finds that: 💡As long as we keep the tail distribution mismatch throughout the sparse rollout above a critical threshold, the RL training will be stable. 💡Even cooler! Through comprehensive control studies of Qwen3-1.7B, 4B, 8B thinking models RL with 40K rollout max length, the critical threshold stays constant across model sizes. 💡Sparrow then finds the optimal dynamic sparse schedule to reach the threshold with minimal cost. 💡Sparrow's findings are empirically validated to generalize in Qwen3-14B, and hold on both Math and Coding RL. 🐤Sparrow empirically helps achieve 2.2× / 2.4× / 2.0× rollout speedup on Qwen3 1.7B / 4B / 8B thinking models, while keeping training stability over extended RL steps. We release the 🐤bird in the following formats. [1/n] Paper: Code: Blog:

75,287 просмотров