正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Reinforcement learning should be able to improve upon behaviors seen when training. In practice, RL agents often struggle to generalize to new long-horizon behaviors. Our new paper studies horizon generalization, the degree RL algorithms generalize to reaching distant goals. 1/

Vivek Myers

1,237 subscribers

79,514 次观看 • 1 年前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

RL X-mas came early. 🎄 For too long, building powerful AI agents with Reinforcement Learning has been blocked by GPU scarcity and complex infrastructure. That ends today. Introducing Serverless RL from wandb, powered by CoreWeave! We're making RL accessible to all.

RL X-mas came early. 🎄 For too long, building powerful AI agents with Reinforcement Learning has been blocked by GPU scarcity and complex infrastructure. That ends today. Introducing Serverless RL from wandb, powered by CoreWeave! We're making RL accessible to all.

Weights & Biases

112,643 次观看 • 8 个月前

Haven't been to a conference in a while, really excited to be at #NeurIPS2024! I'll be helping present 4 of our group's recent papers: 1. Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL 2. Distributional Successor Features Enable Zero-Shot Policy Optimization 3. Learning to Cooperate with Humans using Generative Agents 4. Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning Find more details on each paper and where to find us in this thread (1/6)

Haven't been to a conference in a while, really excited to be at #NeurIPS2024! I'll be helping present 4 of our group's recent papers: 1. Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL 2. Distributional Successor Features Enable Zero-Shot Policy Optimization 3. Learning to Cooperate with Humans using Generative Agents 4. Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning Find more details on each paper and where to find us in this thread (1/6)

Abhishek Gupta

10,777 次观看 • 1 年前

RL is painfully slow 😭 — bottlenecked by super-long CoT rollout. 🔭 Sparse attention should help, but naive sparse rollout hits a brutal efficiency–stability tradeoff: A tedious trial-and-error sparsity sweep for each dense policy is required before an actual RL run. 🐤Sparrow chirps no more pain! Introduce Sparrow: Sparse Rollout for stable and efficient long-context RL. Sparrow finds that: 💡As long as we keep the tail distribution mismatch throughout the sparse rollout above a critical threshold, the RL training will be stable. 💡Even cooler! Through comprehensive control studies of Qwen3-1.7B, 4B, 8B thinking models RL with 40K rollout max length, the critical threshold stays constant across model sizes. 💡Sparrow then finds the optimal dynamic sparse schedule to reach the threshold with minimal cost. 💡Sparrow's findings are empirically validated to generalize in Qwen3-14B, and hold on both Math and Coding RL. 🐤Sparrow empirically helps achieve 2.2× / 2.4× / 2.0× rollout speedup on Qwen3 1.7B / 4B / 8B thinking models, while keeping training stability over extended RL steps. We release the 🐤bird in the following formats. [1/n] Paper: Code: Blog:

RL is painfully slow 😭 — bottlenecked by super-long CoT rollout. 🔭 Sparse attention should help, but naive sparse rollout hits a brutal efficiency–stability tradeoff: A tedious trial-and-error sparsity sweep for each dense policy is required before an actual RL run. 🐤Sparrow chirps no more pain! Introduce Sparrow: Sparse Rollout for stable and efficient long-context RL. Sparrow finds that: 💡As long as we keep the tail distribution mismatch throughout the sparse rollout above a critical threshold, the RL training will be stable. 💡Even cooler! Through comprehensive control studies of Qwen3-1.7B, 4B, 8B thinking models RL with 40K rollout max length, the critical threshold stays constant across model sizes. 💡Sparrow then finds the optimal dynamic sparse schedule to reach the threshold with minimal cost. 💡Sparrow's findings are empirically validated to generalize in Qwen3-14B, and hold on both Math and Coding RL. 🐤Sparrow empirically helps achieve 2.2× / 2.4× / 2.0× rollout speedup on Qwen3 1.7B / 4B / 8B thinking models, while keeping training stability over extended RL steps. We release the 🐤bird in the following formats. [1/n] Paper: Code: Blog:

Infini-AI-Lab

76,895 次观看 • 16 天前

RL is back! But is it always the best choice? In a new paper, we investigate under what circumstances neuroevolution outperforms RL in transfer learning tasks. See more details in the thread below 🧵 While NE performs best in simpler domains, it will be interesting to see if the lessons learned here can also be applied to more complex systems/tasks (LLMs?).

RL is back! But is it always the best choice? In a new paper, we investigate under what circumstances neuroevolution outperforms RL in transfer learning tasks. See more details in the thread below 🧵 While NE performs best in simpler domains, it will be interesting to see if the lessons learned here can also be applied to more complex systems/tasks (LLMs?).

Sebastian Risi

11,130 次观看 • 11 个月前

New research from Databricks: LLMs Can Learn to Reason via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes.

New research from Databricks: LLMs Can Learn to Reason via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes.

Databricks AI Research

12,439 次观看 • 3 个月前

Does LLM RL post-training need to be on-policy?

Does LLM RL post-training need to be on-policy?

Kianté Brantley

113,537 次观看 • 3 个月前

Language models understand natural proteins. But can they generalize beyond, to design completely new proteins from scratch? New preprint: A 🧵

Language models understand natural proteins. But can they generalize beyond, to design completely new proteins from scratch? New preprint: A 🧵

Tom Sercu

275,206 次观看 • 3 年前

Robora Sim: A PyBullet-Powered Environment for Learning Robotic Physical Intelligence We are currently building our Robora simulation environment setup for our sim based learning, leveraging PyBullet, an industry-standard physics engine widely used in AI-driven robotics research and development. The environment is optimized with GPU-accelerated learning algorithms, enabling high-speed imitation learning and reinforcement learning within a safe and controlled virtual setup before shipping out to real world. This simulation platform allows our models to learn, adapt, and generalize across different robot morphologies, terrain types and task objectives - all before deployment to the real world. At it's core, the system combines a VLA-powered high-level planner with low-level motion control algorithms, working cohesively to produce emergent, physically intelligent behaviors. This synergy between simulation, learning, and real-world transfer marks a major step forward in our pursuit of adaptive and intelligent robotic systems. Through advanced domain randomization and synthetic data generation, the Robora Simulation Environment ensures that policies trained in simulation transfer effectively to real-world robots, minimizing the sim-to-real gap. Moreover, users will be able to test and integrate their own hardware kits within selected simulation environments in the Robora Dapp, ensuring seamless compatibility and safer real-world implementation.

Robora Sim: A PyBullet-Powered Environment for Learning Robotic Physical Intelligence We are currently building our Robora simulation environment setup for our sim based learning, leveraging PyBullet, an industry-standard physics engine widely used in AI-driven robotics research and development. The environment is optimized with GPU-accelerated learning algorithms, enabling high-speed imitation learning and reinforcement learning within a safe and controlled virtual setup before shipping out to real world. This simulation platform allows our models to learn, adapt, and generalize across different robot morphologies, terrain types and task objectives - all before deployment to the real world. At it's core, the system combines a VLA-powered high-level planner with low-level motion control algorithms, working cohesively to produce emergent, physically intelligent behaviors. This synergy between simulation, learning, and real-world transfer marks a major step forward in our pursuit of adaptive and intelligent robotic systems. Through advanced domain randomization and synthetic data generation, the Robora Simulation Environment ensures that policies trained in simulation transfer effectively to real-world robots, minimizing the sim-to-real gap. Moreover, users will be able to test and integrate their own hardware kits within selected simulation environments in the Robora Dapp, ensuring seamless compatibility and safer real-world implementation.

Robora

23,489 次观看 • 8 个月前

🔬🤖Excited to share our new preprint on robust virtual staining of key organelles! ✨ Virtual staining and segmentation of nuclei and cell membranes in phase images can significantly accelerate image-based screens and developmental studies. 🔍 Current virtual staining models often struggle with nuisance variations in label-free input and don't generalize well to new cell types. We’re releasing our training protocols that lead to robust models and enable few-shot generalization. 🔧 Stay tuned for our revamped codebase with weights and example data, coming soon! 👏🧫🐟 Huge shoutout to Ziwen Liu, Eduardo Hirata-Miyasaki, my team Chan Zuckerberg Biohub Network (Soorya, Johanna, Christian, Talon Chandler, Ivan Ivanov), and the teams of Manuel Leonetti, Carolina Arias, and Adrian Jacobo for this fun collaboration! #VirtualStaining #DL #CellBiology #Zebrafish

🔬🤖Excited to share our new preprint on robust virtual staining of key organelles! ✨ Virtual staining and segmentation of nuclei and cell membranes in phase images can significantly accelerate image-based screens and developmental studies. 🔍 Current virtual staining models often struggle with nuisance variations in label-free input and don't generalize well to new cell types. We’re releasing our training protocols that lead to robust models and enable few-shot generalization. 🔧 Stay tuned for our revamped codebase with weights and example data, coming soon! 👏🧫🐟 Huge shoutout to Ziwen Liu, Eduardo Hirata-Miyasaki, my team Chan Zuckerberg Biohub Network (Soorya, Johanna, Christian, Talon Chandler, Ivan Ivanov), and the teams of Manuel Leonetti, Carolina Arias, and Adrian Jacobo for this fun collaboration! #VirtualStaining #DL #CellBiology #Zebrafish

Shalin Mehta

20,185 次观看 • 2 年前

lando in the new RL ad!

lando in the new RL ad!

ray

18,331 次观看 • 5 个月前

Over the past months, Cohort I of our RL Residency has been shipping. Highlights - continual learning - automating AI research (from GPU programming to RL itself) - embodied environments - multi-agent systems - materials science discovery

Over the past months, Cohort I of our RL Residency has been shipping. Highlights - continual learning - automating AI research (from GPU programming to RL itself) - embodied environments - multi-agent systems - materials science discovery

Prime Intellect

59,409 次观看 • 1 个月前

We are excited to announce our strategic partnership with Fraction AI for their IIO launch. Fraction AI is the first decentralized training platform for AI agents with millions of sessions, 150,000+ agents deployed, and over $1M in rewards distributed. On our platform, thousands of AI agents compete in specialized domains and continuously improve through Reinforcement Learning. Stay tuned for upcoming surprise announcements.

We are excited to announce our strategic partnership with Fraction AI for their IIO launch. Fraction AI is the first decentralized training platform for AI agents with millions of sessions, 150,000+ agents deployed, and over $1M in rewards distributed. On our platform, thousands of AI agents compete in specialized domains and continuously improve through Reinforcement Learning. Stay tuned for upcoming surprise announcements.

peep icosystem

19,912 次观看 • 9 个月前

Big Pharma after learning of a new deadly pandemic on the horizon

Big Pharma after learning of a new deadly pandemic on the horizon

NFL Memes

92,169 次观看 • 1 个月前

These are not CGI. Reinforcement learning is so back. When operating on strings, it gives us o3. When operating on physical motors, it gives us a perfect humanoid backflip and a robot creature that out-maneuvers almost every animal on earth. RL is one of the only learning algorithms that can master both the world of bits and the world of atoms. Give me a reward function, and I shall move the world. 2025, Year of RL.

These are not CGI. Reinforcement learning is so back. When operating on strings, it gives us o3. When operating on physical motors, it gives us a perfect humanoid backflip and a robot creature that out-maneuvers almost every animal on earth. RL is one of the only learning algorithms that can master both the world of bits and the world of atoms. Give me a reward function, and I shall move the world. 2025, Year of RL.

Jim Fan

356,697 次观看 • 1 年前

It's a Bird... It's a Plane... It's ??? Someone is coming here in our new Mandaluyong branch to discover new horizon! Are you ready? 🤩 Stay tuned!

It's a Bird... It's a Plane... It's ??? Someone is coming here in our new Mandaluyong branch to discover new horizon! Are you ready? 🤩 Stay tuned!

Horizon Spa Mandaluyong

88,437 次观看 • 2 年前

Congratulations also to Patrick 👍 for his #ICLR paper on Temporal Difference (TD) learning , in it We solve the decades-old puzzle of why TD can solve complex RL tasks that Gradient Descent cannot.

Congratulations also to Patrick 👍 for his #ICLR paper on Temporal Difference (TD) learning , in it We solve the decades-old puzzle of why TD can solve complex RL tasks that Gradient Descent cannot.

Thuerey Group at TUM

22,074 次观看 • 1 年前

Yes, such a great BioRob meeting. Enjoyed catching up with everyone. Great to see all the RL+biomechanics here. Also honored that our work on imitation learning for gait analysis was another best paper finalist Nice updates in MuJoCo coming soon!

Yes, such a great BioRob meeting. Enjoyed catching up with everyone. Great to see all the RL+biomechanics here. Also honored that our work on imitation learning for gait analysis was another best paper finalist Nice updates in MuJoCo coming soon!

R. James Cotton

11,944 次观看 • 1 年前

A new level might be on the horizon

A new level might be on the horizon

spark

12,548 次观看 • 3 个月前

1/ While most RL methods use shallow MLPs (~2–5 layers), we show that scaling up to 1000-layers for contrastive RL (CRL) can significantly boost performance, ranging from doubling performance to 50x on a diverse suite of robotic tasks. Webpage+Paper+Code:

1/ While most RL methods use shallow MLPs (~2–5 layers), we show that scaling up to 1000-layers for contrastive RL (CRL) can significantly boost performance, ranging from doubling performance to 50x on a diverse suite of robotic tasks. Webpage+Paper+Code:

Kevin Wang

154,580 次观看 • 1 年前