正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Our work, "A Primer on SO(3) Action Representations in Deep Reinforcement Learning," was accepted to #ICLR2026! We provide a systematic study of action representation choices in RL, showing that they fundamentally impact training stability and performance. #Robotics #AI #RL

Learning Systems and Robotics Lab (is hiring!)

3,003 subscribers

49,655 次观看 • 4 个月前 •via X (Twitter)

科学技术 #ICLR2026 #Robotics #AI #RL

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Self-supervised representation learning looks a bit like RL. What if we literally use RL as a SSL method for visual representations? Turns out that it works quite well. In new work by Dibya Ghosh, we show how this can be done:

Self-supervised representation learning looks a bit like RL. What if we literally use RL as a SSL method for visual representations? Turns out that it works quite well. In new work by Dibya Ghosh, we show how this can be done:

Sergey Levine

48,751 次观看 • 1 年前

We asked Sholto Douglas from Anthropic about the costs of RL (Reinforcement Learning) runs. "In Dario Amodei's essay, he said that RL runs cost only $1M back in December." "RL is a more naively parallelizable and scalable than pre-training." "With pre-training, you need everything in one big data center ideally. For RL, in theory, you could scale all over the world."

We asked Sholto Douglas from Anthropic about the costs of RL (Reinforcement Learning) runs. "In Dario Amodei's essay, he said that RL runs cost only $1M back in December." "RL is a more naively parallelizable and scalable than pre-training." "With pre-training, you need everything in one big data center ideally. For RL, in theory, you could scale all over the world."

TBPN

76,696 次观看 • 1 年前

Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning investigated the application of Deep Reinforcement Learning (Deep RL) for low-cost, miniature humanoid hardware in a dynamic environment, showing the method can synthesize sophisticated and safe movement skills making up complex behavioral strategies in a simplified one-versus-one (1v1) soccer game abs: project page:

Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning investigated the application of Deep Reinforcement Learning (Deep RL) for low-cost, miniature humanoid hardware in a dynamic environment, showing the method can synthesize sophisticated and safe movement skills making up complex behavioral strategies in a simplified one-versus-one (1v1) soccer game abs: project page:

AK

293,352 次观看 • 3 年前

So we did a bunch of projects with real world reinforcement learning - but it was often too inefficient to be practical to train tabula rasa. This suggests we need better priors, but acquiring these from on-robot data can often be expensive as well. In our recent work, we show that despite being fundamentally inaccurate, simulation can guide provide a cheap way to guide real-world RL finetuning to be super efficient! We propose Simulation-Guided Fine-Tuning (SGFT) - a simple paradigm for sim2real finetuning that uses simulation to provide reward shaping that accelerates real world RL finetuning *beyond* just providing an initialization. TLDR: Use value functions from sim to shape rewards for real-world RL, see large sample efficiency improvements 🧵(1/6)

So we did a bunch of projects with real world reinforcement learning - but it was often too inefficient to be practical to train tabula rasa. This suggests we need better priors, but acquiring these from on-robot data can often be expensive as well. In our recent work, we show that despite being fundamentally inaccurate, simulation can guide provide a cheap way to guide real-world RL finetuning to be super efficient! We propose Simulation-Guided Fine-Tuning (SGFT) - a simple paradigm for sim2real finetuning that uses simulation to provide reward shaping that accelerates real world RL finetuning beyond just providing an initialization. TLDR: Use value functions from sim to shape rewards for real-world RL, see large sample efficiency improvements 🧵(1/6)

Abhishek Gupta

13,637 次观看 • 1 年前

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show that quantization helps exploration in RL training. Paper: Code: #NVIDIA #AIResearch #ReinforcementLearning #Quantization #LLM #EfficientAI

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show that quantization helps exploration in RL training. Paper: Code: #NVIDIA #AIResearch #ReinforcementLearning #Quantization #LLM #EfficientAI

Yukang Chen

69,747 次观看 • 9 个月前

🚀 🔥 AgiBot deploys Real-World Reinforcement Learning (RW-RL) in industrial robotics with Longcheer Technology. Robots now learn new skills in tens of MINUTES (not weeks), adapt to variations autonomously, and reconfigure flexibly, solving rigid automation pain points in precision manufacturing. A giant leap of intelligent automation for precision manufacturing! #AgiBot #ReinforcementLearning #RealWorldRL #Robotics #AI #IndustrialRobotics

🚀 🔥 AgiBot deploys Real-World Reinforcement Learning (RW-RL) in industrial robotics with Longcheer Technology. Robots now learn new skills in tens of MINUTES (not weeks), adapt to variations autonomously, and reconfigure flexibly, solving rigid automation pain points in precision manufacturing. A giant leap of intelligent automation for precision manufacturing! #AgiBot #ReinforcementLearning #RealWorldRL #Robotics #AI #IndustrialRobotics

AGIBOT

131,749 次观看 • 8 个月前

Check out our latest work, "Actor-Critic Model Predictive Control: Differentiable Optimization meets Reinforcement Learning for Agile Flight," published in the IEEE Transactions on Robotics, where we reconcile #OptimalControl and #ReinforcementLearning, achieving the same super-human performance, but with superior generalizability, as our previous model-free deep RL! Code released! PDF: Code: Full Video: Model-free #ReinforcementLearning (RL) is known for its strong task performance and flexibility in optimizing general reward formulations. On the other hand, #ModelPredictiveControl (MPC) provides robustness, constraint handling, and powerful online replanning capabilities. In this work, we extend our previous AC-MPC paper (Romero, ICRA'24) by taking a deeper look at how both approaches can be unified. We introduce and extend Actor-Critic Model Predictive Control (AC-MPC), a framework that embeds a differentiable MPC inside an Actor-Critic RL architecture. This integration allows the MPC-based actor to perform short-term predictive optimization, while the critic facilitates long-horizon learning and exploration. We conduct a comprehensive study that highlights AC-MPC’s key advantages: - Better out-of-distribution generalization, both against unknown disturbances and changes in the quadrotor dynamics - Improved sample efficiency - A novel empirical analysis uncovering a relationship between the critic’s value function and the MPC cost function, providing deeper insight into their interplay. We validate our method in simulation and the real world on a quadcopter flying at superhuman speeds of up to 21 m/s, matching state-of-the-art model-free RL performance, and retaining the predictive structure of MPC for more reliable out-of-distribution behavior. Reference: Actor-Critic Model Predictive Control: Differentiable Optimization meets Reinforcement Learning for Agile Flight IEEE Transactions on Robotics (T-RO), 2025 PDF: Full Video: Code: Kudos to Ángel Romero, Elie Aljalbout, Yunlong Song! University of Zurich UZH Science UZH Space Hub AUTOASSESS European Research Council (ERC) UZHai

Check out our latest work, "Actor-Critic Model Predictive Control: Differentiable Optimization meets Reinforcement Learning for Agile Flight," published in the IEEE Transactions on Robotics, where we reconcile #OptimalControl and #ReinforcementLearning, achieving the same super-human performance, but with superior generalizability, as our previous model-free deep RL! Code released! PDF: Code: Full Video: Model-free #ReinforcementLearning (RL) is known for its strong task performance and flexibility in optimizing general reward formulations. On the other hand, #ModelPredictiveControl (MPC) provides robustness, constraint handling, and powerful online replanning capabilities. In this work, we extend our previous AC-MPC paper (Romero, ICRA'24) by taking a deeper look at how both approaches can be unified. We introduce and extend Actor-Critic Model Predictive Control (AC-MPC), a framework that embeds a differentiable MPC inside an Actor-Critic RL architecture. This integration allows the MPC-based actor to perform short-term predictive optimization, while the critic facilitates long-horizon learning and exploration. We conduct a comprehensive study that highlights AC-MPC’s key advantages: - Better out-of-distribution generalization, both against unknown disturbances and changes in the quadrotor dynamics - Improved sample efficiency - A novel empirical analysis uncovering a relationship between the critic’s value function and the MPC cost function, providing deeper insight into their interplay. We validate our method in simulation and the real world on a quadcopter flying at superhuman speeds of up to 21 m/s, matching state-of-the-art model-free RL performance, and retaining the predictive structure of MPC for more reliable out-of-distribution behavior. Reference: Actor-Critic Model Predictive Control: Differentiable Optimization meets Reinforcement Learning for Agile Flight IEEE Transactions on Robotics (T-RO), 2025 PDF: Full Video: Code: Kudos to Ángel Romero, Elie Aljalbout, Yunlong Song! University of Zurich UZH Science UZH Space Hub AUTOASSESS European Research Council (ERC) UZHai

Davide Scaramuzza

27,090 次观看 • 6 个月前

The network for machine intelligence Two years ago, we laid out our vision for a machine learning compute protocol. One that connects every device in the world into an open network for machine intelligence, with no gatekeepers or artificial boundaries. This week, we’ll be sharing some of our early progress, beginning with RL Swarm, a peer-to-peer system for collaborative reinforcement learning over the internet. Next month, we’ll open our Testnet, allowing anyone to contribute to the frontier of open machine intelligence. Introducing RL Swarm RL Swarm is a fully open source system for collaborative reinforcement learning over the internet. It is a live demo of our research findings, which show that models training with RL learn faster when they train as a collective swarm than they do on their own. Join our swarm now to see this in practice. You can participate with consumer hardware at home or a powerful GPU in the cloud. You can follow along with the swarm’s progress by following the links below.

The network for machine intelligence Two years ago, we laid out our vision for a machine learning compute protocol. One that connects every device in the world into an open network for machine intelligence, with no gatekeepers or artificial boundaries. This week, we’ll be sharing some of our early progress, beginning with RL Swarm, a peer-to-peer system for collaborative reinforcement learning over the internet. Next month, we’ll open our Testnet, allowing anyone to contribute to the frontier of open machine intelligence. Introducing RL Swarm RL Swarm is a fully open source system for collaborative reinforcement learning over the internet. It is a live demo of our research findings, which show that models training with RL learn faster when they train as a collective swarm than they do on their own. Join our swarm now to see this in practice. You can participate with consumer hardware at home or a powerful GPU in the cloud. You can follow along with the swarm’s progress by following the links below.

gensyn

228,892 次观看 • 1 年前

"One of the very confusing things about the models right now: how to reconcile the fact that they are doing so well on evals. And you look at the evals and you go, 'Those are pretty hard evals.' But the economic impact seems to be dramatically behind. There is [a possible] explanation. Back when people were doing pre-training, the question of what data to train on was answered, because that answer was everything. So you don't have to think if it's going to be this data or that data. When people do RL training, they say, 'Okay, we want to have this kind of RL training for this thing and that kind of RL training for that thing.' You say, 'Hey, I would love our model to do really well when we release it. I want the evals to look great. What would be RL training that could help on this task?' If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance"

"One of the very confusing things about the models right now: how to reconcile the fact that they are doing so well on evals. And you look at the evals and you go, 'Those are pretty hard evals.' But the economic impact seems to be dramatically behind. There is [a possible] explanation. Back when people were doing pre-training, the question of what data to train on was answered, because that answer was everything. So you don't have to think if it's going to be this data or that data. When people do RL training, they say, 'Okay, we want to have this kind of RL training for this thing and that kind of RL training for that thing.' You say, 'Hey, I would love our model to do really well when we release it. I want the evals to look great. What would be RL training that could help on this task?' If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance"

Dwarkesh Patel

502,162 次观看 • 8 个月前

We open-source Action Images — a new representation that translates 7-DoF robot actions into interpretable images. Video models are emerging as powerful robotic foundation models, but a key challenge remains: how can we seamlessly integrate robot policies into video models? Instead of representing actions as low-dimensional control tokens, Action Images provide a pixel-grounded action representation, reframing policy learning as a visual tracking problem! By unifying observations and actions in the same video space, Action Images enable a unified robotics world model that supports video-action joint generation, action-conditioned video generation, and action labeling! Code: Paper:

We open-source Action Images — a new representation that translates 7-DoF robot actions into interpretable images. Video models are emerging as powerful robotic foundation models, but a key challenge remains: how can we seamlessly integrate robot policies into video models? Instead of representing actions as low-dimensional control tokens, Action Images provide a pixel-grounded action representation, reframing policy learning as a visual tracking problem! By unifying observations and actions in the same video space, Action Images enable a unified robotics world model that supports video-action joint generation, action-conditioned video generation, and action labeling! Code: Paper:

Chuang Gan

19,535 次观看 • 1 个月前

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

Physical Intelligence

435,870 次观看 • 4 个月前

Deployment-Ready RL: Pitfalls, Lessons, and Best Practices We’ve published a full transcript of a webinar with Kyle🤖🚀🦭 (UT Austin) and the Humanoid team on our blog. He explores Sim2Real RL: - Action Space - Observation Space - Dealing with Model Mismatch - Reward Tuning Intuition - RL with Motion References We’re sharing this to spread knowledge & help push humanoid robotics forward. Read or listen here:

Deployment-Ready RL: Pitfalls, Lessons, and Best Practices We’ve published a full transcript of a webinar with Kyle🤖🚀🦭 (UT Austin) and the Humanoid team on our blog. He explores Sim2Real RL: - Action Space - Observation Space - Dealing with Model Mismatch - Reward Tuning Intuition - RL with Motion References We’re sharing this to spread knowledge & help push humanoid robotics forward. Read or listen here:

Humanoid

60,003 次观看 • 10 个月前

How GPT-5 thinks, with OpenAI VP of Research Jerry Tworek 00:00 - Intro 01:01 - What Reasoning Actually Means in AI 02:32 - Chain of Thought: Models Thinking in Words 05:25 - How Models Decide How Long to Think 07:24 - Evolution from o1 to o3 to GPT-5 11:00 - The Road to OpenAI: Growing up in Poland, Dropping out of School, Trading 20:32 - Working on Robotics and Rubik's Cube Solving 23:02 - A Day in the Life: Talking to Researchers 24:06 - How Research Priorities Are Determined 26:53 - OpenAI's Culture of Transparency 29:32 - Balancing Research with Shipping Fast 31:52 - Using OpenAI's Own Tools Daily 32:43 - Pre-Training Plus RL: The Modern AI Stack 35:10 - Reinforcement Learning 101: Training Dogs 40:17 - The Evolution of Deep Reinforcement Learning 42:09 - When GPT-4 Seemed Underwhelming at First 45:39 - How RLHF Made GPT-4 Actually Useful 48:02 - Unsupervised vs Supervised Learning 49:59 - GRPO and How DeepSeek Accelerated US Research 53:05 - What It Takes to Scale Reinforcement Learning 55:36 - Agentic AI and Long-Horizon Thinking 59:19 - Alignment as an RL Problem 1:01:11 - Winning ICPC World Finals Without Specific Training 1:05:53 - Applying RL Beyond Math and Coding 1:09:15 - The Path from Here to AGI 1:12:23 - Pure RL vs Language Models

How GPT-5 thinks, with OpenAI VP of Research Jerry Tworek 00:00 - Intro 01:01 - What Reasoning Actually Means in AI 02:32 - Chain of Thought: Models Thinking in Words 05:25 - How Models Decide How Long to Think 07:24 - Evolution from o1 to o3 to GPT-5 11:00 - The Road to OpenAI: Growing up in Poland, Dropping out of School, Trading 20:32 - Working on Robotics and Rubik's Cube Solving 23:02 - A Day in the Life: Talking to Researchers 24:06 - How Research Priorities Are Determined 26:53 - OpenAI's Culture of Transparency 29:32 - Balancing Research with Shipping Fast 31:52 - Using OpenAI's Own Tools Daily 32:43 - Pre-Training Plus RL: The Modern AI Stack 35:10 - Reinforcement Learning 101: Training Dogs 40:17 - The Evolution of Deep Reinforcement Learning 42:09 - When GPT-4 Seemed Underwhelming at First 45:39 - How RLHF Made GPT-4 Actually Useful 48:02 - Unsupervised vs Supervised Learning 49:59 - GRPO and How DeepSeek Accelerated US Research 53:05 - What It Takes to Scale Reinforcement Learning 55:36 - Agentic AI and Long-Horizon Thinking 59:19 - Alignment as an RL Problem 1:01:11 - Winning ICPC World Finals Without Specific Training 1:05:53 - Applying RL Beyond Math and Coding 1:09:15 - The Path from Here to AGI 1:12:23 - Pure RL vs Language Models

Matt Turck

451,229 次观看 • 9 个月前

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

Prime Intellect

1,142,730 次观看 • 8 个月前

New Episode: Carolina Parada leads the robotics team at Google DeepMind. Carolina believes in a future with a broad, rich ecosystem of diverse robot types, where AI is smart enough to embody any robot. We discuss her journey, advancements in Gemini Robotics 1.5, cross-embodiment transfer, RL versus imitation, humanoids, world models, scaling laws, societal concerns, and more. 1:17 Introduction 1:38 Career journey and inspirations 3:50 Reaction to the 2022 ChatGPT launch 4:53 Google DeepMind's robotics mission 9:13 Key upgrades in Gemini Robotics 1.5 14:06 Agentic system's web usage 15:33 Robotics data-gathering methods 16:57 VLA vs. reasoning model training differences 18:22 Convergence of action and reasoning models 19:35 Cross-embodiment transfer 22:34 Learning directly from humans 24:20 Generalization challenges 27:01 Imitation versus reinforcement learning 28:51 Use of world models in robotics 30:31 Applications and testing of Gemini Robotics 1.5 33:04 Do humanoids deserve special focus? 35:06 Home humanoids timeline & limiting factors 37:19 Scaling laws in robotics versus self-driving 38:46 Value of learning classical robotics approaches 40:38 Ethical issues and societal impact

New Episode: Carolina Parada leads the robotics team at Google DeepMind. Carolina believes in a future with a broad, rich ecosystem of diverse robot types, where AI is smart enough to embody any robot. We discuss her journey, advancements in Gemini Robotics 1.5, cross-embodiment transfer, RL versus imitation, humanoids, world models, scaling laws, societal concerns, and more. 1:17 Introduction 1:38 Career journey and inspirations 3:50 Reaction to the 2022 ChatGPT launch 4:53 Google DeepMind's robotics mission 9:13 Key upgrades in Gemini Robotics 1.5 14:06 Agentic system's web usage 15:33 Robotics data-gathering methods 16:57 VLA vs. reasoning model training differences 18:22 Convergence of action and reasoning models 19:35 Cross-embodiment transfer 22:34 Learning directly from humans 24:20 Generalization challenges 27:01 Imitation versus reinforcement learning 28:51 Use of world models in robotics 30:31 Applications and testing of Gemini Robotics 1.5 33:04 Do humanoids deserve special focus? 35:06 Home humanoids timeline & limiting factors 37:19 Scaling laws in robotics versus self-driving 38:46 Value of learning classical robotics approaches 40:38 Ethical issues and societal impact

The Humanoid Hub

71,481 次观看 • 9 个月前

Frontier models that use reasoning to "think" during inference are generating 5X more AI tokens per year. ✨ "Inference is now a thinking process. And in order to teach AI how to think, reinforcement learning and very significant computation was introduced into post-training," said NVIDIA CEO Jensen Huang in a recent keynote at #CES2026. Reinforcement learning is increasing computation demands across all AI scaling laws: pre-training, post-training, and test-time scaling. Learn more about reinforcement learning and AI scaling ➡️

Frontier models that use reasoning to "think" during inference are generating 5X more AI tokens per year. ✨ "Inference is now a thinking process. And in order to teach AI how to think, reinforcement learning and very significant computation was introduced into post-training," said NVIDIA CEO Jensen Huang in a recent keynote at #CES2026. Reinforcement learning is increasing computation demands across all AI scaling laws: pre-training, post-training, and test-time scaling. Learn more about reinforcement learning and AI scaling ➡️

NVIDIA AI Infrastructure

13,329 次观看 • 6 个月前

RL is a powerful mechanism for training company-specific models on their unique work and data. This is what we do at Applied Compute. A key challenge is how to make RL efficient, because we need runs to be fast (delivered in days), cheap (scalable unit economics), and predictable (not just fast, but reliably fast). Here are some takeaways: • Synchronous RL is wasteful with time and compute. • Asynchronous RL is more efficient but introduces staleness, which causes learning instabilities. • Modeling and simulations can help analytically solve for what configuration leads to optimal efficiency. This allows us to rapidly prototype training configurations, without burning expensive compute cycles on trial runs. Two of our co-founders, Rhythm Garg and Linden Li, discussed some of this research at AI Engineer recently, with a focus on the following subproblem: what is the highest throughput way to do RL given a maximum staleness and compute budget?

RL is a powerful mechanism for training company-specific models on their unique work and data. This is what we do at Applied Compute. A key challenge is how to make RL efficient, because we need runs to be fast (delivered in days), cheap (scalable unit economics), and predictable (not just fast, but reliably fast). Here are some takeaways: • Synchronous RL is wasteful with time and compute. • Asynchronous RL is more efficient but introduces staleness, which causes learning instabilities. • Modeling and simulations can help analytically solve for what configuration leads to optimal efficiency. This allows us to rapidly prototype training configurations, without burning expensive compute cycles on trial runs. Two of our co-founders, Rhythm Garg and Linden Li, discussed some of this research at AI Engineer recently, with a focus on the following subproblem: what is the highest throughput way to do RL given a maximum staleness and compute budget?

Applied Compute

45,555 次观看 • 7 个月前

Last semester, I taught Reinforcement Learning class again at UCLA. Together with my amazing TAs Matthew and Caiyuan, we built a mini-project: MetaDrive Arena 🚗🤖 Students applied what they learned in class, trained RL agents, and competed on a live leaderboard. The results were incredible, with 94 agents, 2K submissions, and 130K matches. We saw tons of creativity, clever ideas, and real progress in learning. We’re now releasing it publicly to support RL education and experimentation. Try it out and train your own agent at 🔗

Last semester, I taught Reinforcement Learning class again at UCLA. Together with my amazing TAs Matthew and Caiyuan, we built a mini-project: MetaDrive Arena 🚗🤖 Students applied what they learned in class, trained RL agents, and competed on a live leaderboard. The results were incredible, with 94 agents, 2K submissions, and 130K matches. We saw tons of creativity, clever ideas, and real progress in learning. We’re now releasing it publicly to support RL education and experimentation. Try it out and train your own agent at 🔗

Bolei Zhou

25,253 次观看 • 3 个月前

Introducing Reinforcement-Learned Teachers (RLTs): Transforming how we teach LLMs to reason with reinforcement learning (RL). Blog: Paper: Traditional RL focuses on “learning to solve” challenging problems with expensive LLMs and constitutes a key step in making student AI systems ultimately acquire reasoning capabilities via distillation and cold-starting. Enter our RLTs—a new class of models prompted with not only a problem’s question but also its solution, and directly trained to generate clear, step-by-step “explanations” to teach their students. Remarkably, an RLT with only 7B parameters produces superior results when distilling and cold-starting students in competitive and graduate-level reasoning tasks than orders-of-magnitude larger LLMs. RLTs are as effective even when distilling 32B students, much larger than the teacher itself—unlocking a new standard for efficiency in developing reasoning language models with RL. Code:

Introducing Reinforcement-Learned Teachers (RLTs): Transforming how we teach LLMs to reason with reinforcement learning (RL). Blog: Paper: Traditional RL focuses on “learning to solve” challenging problems with expensive LLMs and constitutes a key step in making student AI systems ultimately acquire reasoning capabilities via distillation and cold-starting. Enter our RLTs—a new class of models prompted with not only a problem’s question but also its solution, and directly trained to generate clear, step-by-step “explanations” to teach their students. Remarkably, an RLT with only 7B parameters produces superior results when distilling and cold-starting students in competitive and graduate-level reasoning tasks than orders-of-magnitude larger LLMs. RLTs are as effective even when distilling 32B students, much larger than the teacher itself—unlocking a new standard for efficiency in developing reasoning language models with RL. Code:

Sakana AI

179,276 次观看 • 1 年前