Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

🚨 RL for LLMs is finally accessible. Introducing OpenTinker: The first community-driven, open-source framework designed to democratize Reinforcement Learning for LLMs. Inspired by Thinking Machines's amazing Tinker, we realize the biggest bottleneck in agentic LLM research isn’t the math—it’s the setup. Current RL pipelines are messy. Configuring VeRL for... every single experiment is a productivity killer. OpenTinker fixed it. 🛠 How OpenTinker Works: Decoupled Design of Server and Client - Setup Once, Run Forever: Configure the OpenTinker backend on your GPU cluster once. - Develop Locally: Define your RL environments directly on your laptop. - Train on the Cloud: Simply point your local client to the backend. The cluster handles the compute; you handle the science. 📉 The 10x Development Efficiency Thanks to our elegant architectural decomposition, OpenTinker reduces the time to develop a new RL training pipeline by at least an order of magnitude. ⚡ Turn Idle GPU Compute into Gold Small labs often have underutilized hardware. OpenTinker turns your idle GPUs into an internal/external API service for - RL Training - SFT - Inference 🎯 Who needs OpenTinker? - Researchers tired of infrastructure hell. - Labs needing to standardize workflows. - Teams wanting to maximize hardware ROI. Thanks my amazing PhD student Siqi Zhu for leading the project. We are building the future of open RL infra. Be the first to build with us. 👇 Start Building with OpenTinker Now 🚀 Repo: 🌐 Blog: If you believe RL should be accessible to everyone, give us a star, repost this 🔄 post, and let us know what agents you plan to build!show more

Jiaxuan You

4,824 subscribers

58,120 Aufrufe • vor 6 Monaten •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Hiring RL Engineer! Started off as a curious project at Lossfunk to push the boundaries of LLMs in social reasoning - we are now building RL environments, data, and benchmarks to simulate more real-world scenarios. If you want to train SoTA RL models over multi-GPUs (H200s/B200s) to unlock next AI frontier, this is for you.

Hiring RL Engineer! Started off as a curious project at Lossfunk to push the boundaries of LLMs in social reasoning - we are now building RL environments, data, and benchmarks to simulate more real-world scenarios. If you want to train SoTA RL models over multi-GPUs (H200s/B200s) to unlock next AI frontier, this is for you.

Satpal Singh Rathore

45,915 Aufrufe • vor 10 Monaten

The network for machine intelligence Two years ago, we laid out our vision for a machine learning compute protocol. One that connects every device in the world into an open network for machine intelligence, with no gatekeepers or artificial boundaries. This week, we’ll be sharing some of our early progress, beginning with RL Swarm, a peer-to-peer system for collaborative reinforcement learning over the internet. Next month, we’ll open our Testnet, allowing anyone to contribute to the frontier of open machine intelligence. Introducing RL Swarm RL Swarm is a fully open source system for collaborative reinforcement learning over the internet. It is a live demo of our research findings, which show that models training with RL learn faster when they train as a collective swarm than they do on their own. Join our swarm now to see this in practice. You can participate with consumer hardware at home or a powerful GPU in the cloud. You can follow along with the swarm’s progress by following the links below.

The network for machine intelligence Two years ago, we laid out our vision for a machine learning compute protocol. One that connects every device in the world into an open network for machine intelligence, with no gatekeepers or artificial boundaries. This week, we’ll be sharing some of our early progress, beginning with RL Swarm, a peer-to-peer system for collaborative reinforcement learning over the internet. Next month, we’ll open our Testnet, allowing anyone to contribute to the frontier of open machine intelligence. Introducing RL Swarm RL Swarm is a fully open source system for collaborative reinforcement learning over the internet. It is a live demo of our research findings, which show that models training with RL learn faster when they train as a collective swarm than they do on their own. Join our swarm now to see this in practice. You can participate with consumer hardware at home or a powerful GPU in the cloud. You can follow along with the swarm’s progress by following the links below.

gensyn

228,703 Aufrufe • vor 1 Jahr

Introducing the Environments Hub RL environments are the key bottleneck to the next wave of AI progress, but big labs are locking them down We built a community platform for crowdsourcing open environments, so anyone can contribute to open-source AGI

Introducing the Environments Hub RL environments are the key bottleneck to the next wave of AI progress, but big labs are locking them down We built a community platform for crowdsourcing open environments, so anyone can contribute to open-source AGI

Prime Intellect

1,814,857 Aufrufe • vor 10 Monaten

Testing image-to-CAD for SR Platform on a new set of GPUs 🔥 We are working with lium.io (SN51) to showcase the efficiency of decentralized compute for SR Platform — from CAD generation to RL training and sim-to-real validation What it means: > Flexible, on-demand rental for us. Pay with crypto. > Diverse GPU options, powered by Openτensor Foundaτion's incentive alignment. > Lower equipment cost → higher margin for the network. Onchain robots, onchain compute.

Testing image-to-CAD for SR Platform on a new set of GPUs 🔥 We are working with lium.io (SN51) to showcase the efficiency of decentralized compute for SR Platform — from CAD generation to RL training and sim-to-real validation What it means: > Flexible, on-demand rental for us. Pay with crypto. > Diverse GPU options, powered by Openτensor Foundaτion's incentive alignment. > Lower equipment cost → higher margin for the network. Onchain robots, onchain compute.

Strike Robot

17,922 Aufrufe • vor 1 Monat

RL is a powerful mechanism for training company-specific models on their unique work and data. This is what we do at Applied Compute. A key challenge is how to make RL efficient, because we need runs to be fast (delivered in days), cheap (scalable unit economics), and predictable (not just fast, but reliably fast). Here are some takeaways: • Synchronous RL is wasteful with time and compute. • Asynchronous RL is more efficient but introduces staleness, which causes learning instabilities. • Modeling and simulations can help analytically solve for what configuration leads to optimal efficiency. This allows us to rapidly prototype training configurations, without burning expensive compute cycles on trial runs. Two of our co-founders, Rhythm Garg 🚂 and Linden Li, discussed some of this research at AI Engineer recently, with a focus on the following subproblem: what is the highest throughput way to do RL given a maximum staleness and compute budget?

RL is a powerful mechanism for training company-specific models on their unique work and data. This is what we do at Applied Compute. A key challenge is how to make RL efficient, because we need runs to be fast (delivered in days), cheap (scalable unit economics), and predictable (not just fast, but reliably fast). Here are some takeaways: • Synchronous RL is wasteful with time and compute. • Asynchronous RL is more efficient but introduces staleness, which causes learning instabilities. • Modeling and simulations can help analytically solve for what configuration leads to optimal efficiency. This allows us to rapidly prototype training configurations, without burning expensive compute cycles on trial runs. Two of our co-founders, Rhythm Garg 🚂 and Linden Li, discussed some of this research at AI Engineer recently, with a focus on the following subproblem: what is the highest throughput way to do RL given a maximum staleness and compute budget?

Applied Compute

45,480 Aufrufe • vor 6 Monaten

1/ Introducing RL Swarm’s new backend: GenRL. A modular reinforcement learning library built for distributed, fault-tolerant training - now powering RL Swarm from the ground up. 🧵

1/ Introducing RL Swarm’s new backend: GenRL. A modular reinforcement learning library built for distributed, fault-tolerant training - now powering RL Swarm from the ground up. 🧵

gensyn

82,166 Aufrufe • vor 1 Jahr

Take Part in a Groundbreaking Experiment in AI-Based Gaming! Introducing: AI Arena RL AI ARENA | ʌı RL is a groundbreaking experiment where your gameplay shapes AI agents, pioneering the future of AI-powered esports. Powered by NRN Agents Reinforcement Learning (RL), this campaign uses player-contributed gameplay data to create superhuman RL agents that compete in AI vs. AI esports exhibitions. The Ronin community has exclusive access to contribute data to these agents. Phase 1 - How It Works: ✅ Sign in to with your Ronin Wallet. ✅ Choose your AIcon – Select which RL agent learns from your gameplay: jAIhoz.ron AgentYP Agent J3ff @AgentSploots ✅ Play AI Arena Classic – A Smash Bros-style platform fighter. ✅ Build your Contribution Score – Better gameplay means higher scores. ✅ Earn $NRN Rewards – Share in the 500,000 NRN prize pool: 🔹 80% distributed to Players (data contributors) 🔹 20% awarded to AIcons based on leaderboard rank at the end of the campaign Phase 1 of the campaign kicks off now and will last for 48 hrs. For complete information about the campaign please visit: History Is Being Made Right Now—Don't Miss Your Chance to Be Part of It. Are you ready to redefine the future of AI gaming?

Take Part in a Groundbreaking Experiment in AI-Based Gaming! Introducing: AI Arena RL AI ARENA | ʌı RL is a groundbreaking experiment where your gameplay shapes AI agents, pioneering the future of AI-powered esports. Powered by NRN Agents Reinforcement Learning (RL), this campaign uses player-contributed gameplay data to create superhuman RL agents that compete in AI vs. AI esports exhibitions. The Ronin community has exclusive access to contribute data to these agents. Phase 1 - How It Works: ✅ Sign in to with your Ronin Wallet. ✅ Choose your AIcon – Select which RL agent learns from your gameplay: jAIhoz.ron AgentYP Agent J3ff @AgentSploots ✅ Play AI Arena Classic – A Smash Bros-style platform fighter. ✅ Build your Contribution Score – Better gameplay means higher scores. ✅ Earn $NRN Rewards – Share in the 500,000 NRN prize pool: 🔹 80% distributed to Players (data contributors) 🔹 20% awarded to AIcons based on leaderboard rank at the end of the campaign Phase 1 of the campaign kicks off now and will last for 48 hrs. For complete information about the campaign please visit: History Is Being Made Right Now—Don't Miss Your Chance to Be Part of It. Are you ready to redefine the future of AI gaming?

NRN Agents

44,353 Aufrufe • vor 1 Jahr

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

Physical Intelligence

431,287 Aufrufe • vor 3 Monaten

"One of the very confusing things about the models right now: how to reconcile the fact that they are doing so well on evals. And you look at the evals and you go, 'Those are pretty hard evals.' But the economic impact seems to be dramatically behind. There is [a possible] explanation. Back when people were doing pre-training, the question of what data to train on was answered, because that answer was everything. So you don't have to think if it's going to be this data or that data. When people do RL training, they say, 'Okay, we want to have this kind of RL training for this thing and that kind of RL training for that thing.' You say, 'Hey, I would love our model to do really well when we release it. I want the evals to look great. What would be RL training that could help on this task?' If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance"

"One of the very confusing things about the models right now: how to reconcile the fact that they are doing so well on evals. And you look at the evals and you go, 'Those are pretty hard evals.' But the economic impact seems to be dramatically behind. There is [a possible] explanation. Back when people were doing pre-training, the question of what data to train on was answered, because that answer was everything. So you don't have to think if it's going to be this data or that data. When people do RL training, they say, 'Okay, we want to have this kind of RL training for this thing and that kind of RL training for that thing.' You say, 'Hey, I would love our model to do really well when we release it. I want the evals to look great. What would be RL training that could help on this task?' If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance"

Dwarkesh Patel

502,037 Aufrufe • vor 7 Monaten

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

Prime Intellect

1,137,660 Aufrufe • vor 7 Monaten

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 Aufrufe • vor 11 Monaten

🔥Excited to share our new work: "A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning"! We systematically study what actually works (and what doesn't) for agentic multi-turn RL, breaking down the design space into 3 pillars: 🌎Environment, 🤖Policy, and ⭐Reward. We conduct various ablations and empirical analysis on 🧩TextWorld, 🧙ALFWorld, and 🧑‍💻SWE-Gym. We also open source our codebase: 🐈🍵Meow-Tea-Taro 💜, a modular multi-turn RL framework with configurable environments, policies, and rewards! #ReinforcementLearning #AgenticAI #LLMs #Agents

🔥Excited to share our new work: "A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning"! We systematically study what actually works (and what doesn't) for agentic multi-turn RL, breaking down the design space into 3 pillars: 🌎Environment, 🤖Policy, and ⭐Reward. We conduct various ablations and empirical analysis on 🧩TextWorld, 🧙ALFWorld, and 🧑‍💻SWE-Gym. We also open source our codebase: 🐈🍵Meow-Tea-Taro 💜, a modular multi-turn RL framework with configurable environments, policies, and rewards! #ReinforcementLearning #AgenticAI #LLMs #Agents

Ruiyi Wang 王睿仪

52,970 Aufrufe • vor 8 Monaten

The QVAC SDK is the "LEGO block" of the next era of computing. It’s a modular, local-first framework designed to turn anything—from a simple robot to an industrial server—into a sovereign, autonomous mind. Why build with QVAC? Atomic Intelligence: AI as a raw material embedded directly into your hardware. No Cloud Dependency: 0 latency and total privacy. If the internet breaks, your world keeps thinking. Infinite Scale: A single API for local AI that runs on any device, anywhere. From a child’s toy to the fabric of the universe, if you can dream it, you can build it. Start building the future: 🚀

The QVAC SDK is the "LEGO block" of the next era of computing. It’s a modular, local-first framework designed to turn anything—from a simple robot to an industrial server—into a sovereign, autonomous mind. Why build with QVAC? Atomic Intelligence: AI as a raw material embedded directly into your hardware. No Cloud Dependency: 0 latency and total privacy. If the internet breaks, your world keeps thinking. Infinite Scale: A single API for local AI that runs on any device, anywhere. From a child’s toy to the fabric of the universe, if you can dream it, you can build it. Start building the future: 🚀

QVAC

4,706,933 Aufrufe • vor 2 Monaten

I hand-wrote a 500-LoC RL stack to make hacking on RL research much easier. Most RL stacks are either massive and unhackable, or duct-taped research scripts. I am open-sourcing Mithrl, a modular RLVR stack. Next items on my checklist: adding more complex environment examples, supporting multi-gpu + async RL, and QoL fixes. I might scrap external runtime dependencies (Huggingface PEFT + vLLM) and write purpose-built, simpler versions from scratch if I feel the need. If you want to experiment with RL and are looking to own sovereign tools, I’d love to get on call, understand your requirements and help integrate for free.

I hand-wrote a 500-LoC RL stack to make hacking on RL research much easier. Most RL stacks are either massive and unhackable, or duct-taped research scripts. I am open-sourcing Mithrl, a modular RLVR stack. Next items on my checklist: adding more complex environment examples, supporting multi-gpu + async RL, and QoL fixes. I might scrap external runtime dependencies (Huggingface PEFT + vLLM) and write purpose-built, simpler versions from scratch if I feel the need. If you want to experiment with RL and are looking to own sovereign tools, I’d love to get on call, understand your requirements and help integrate for free.

omkaar

17,218 Aufrufe • vor 3 Monaten

I like to think of evals as something active, not passive -- it's a North Star that steers LLMs toward higher intelligence. Evals should drive your RL/SFT/post-training decisions. Internal evals at frontier labs make a huge difference -- and you can see it in how models behave differently (GPT seems better at one-shot tasks, Claude at multi-turn). If you want to learn more about building evals that actually improve your model in post-training, check out our AMD x DeeplearningAI course "Fine-tuning & RL for LLMs: Intro to Post-training" (content is free):

I like to think of evals as something active, not passive -- it's a North Star that steers LLMs toward higher intelligence. Evals should drive your RL/SFT/post-training decisions. Internal evals at frontier labs make a huge difference -- and you can see it in how models behave differently (GPT seems better at one-shot tasks, Claude at multi-turn). If you want to learn more about building evals that actually improve your model in post-training, check out our AMD x DeeplearningAI course "Fine-tuning & RL for LLMs: Intro to Post-training" (content is free):

Sharon Zhou

16,613 Aufrufe • vor 4 Monaten

Today's Training Data episode takes us BTS on the infrastructure challenges required to do large RL runs at scale, featuring Federico Cassano (Composer Lead at Cursor) and Dmytro Dzhulgakov (Co-Founder at Fireworks AI). The Cursor team trained Composer 2 on Fireworks by starting with a strong base model (Kimi 2.5) and performing large-scale mid-training on code tokens and web data to learn common patterns and libraries, followed by a large-scale Reinforcement Learning run to learn how to navigate the Cursor harness, call tools, and write correct code. Today's episode dives into the systems and infrastructure challenges of making that large RL run happening, and there were many (!!), from numerical mismatch to global distribution to synchronizing rollouts across asynchronous pipelines to keeping track of expert activation across runs and more. Extremely nerdy in-the-weeds challenges that Federico and Dima were delighted to nerd out on together :) Beyond RL infra, we also discussed Online vs Simulated rollouts, self-summarization for long-horizon agents, environment design ("the most powerful RL environment is the product itself"), and other technical nuggets. PS: We filmed this episode before the SpaceX news, while the Cursor team was still compute-constrained. While Cursor now has *all* the flops, the takeaways and hurdles crossed ring true for any serious application-level company that is racing to post-train their own models. I believe that more serious application companies will go the way of Cursor and post-train their own models. 00:00 Introduction 00:53 Why Cursor Trained Composer 2 04:55 Specialization vs Bitter Lesson 06:16 Composer 2 Training Recipe 16:32 Scaling RL Infrastructure Globally 23:32 Floating Point Drift 25:11 MoE Sensitivity Explained 26:25 Router Replay Fix 27:19 Real Time RL Loop 31:49 Long Horizon Agents 34:29 Why RL Everywhere 37:34 LLM as Judge Rewards 39:14 RL in Hard Domains 40:13 Build Your Own Environments 44:34 Closing Thoughts

Today's Training Data episode takes us BTS on the infrastructure challenges required to do large RL runs at scale, featuring Federico Cassano (Composer Lead at Cursor) and Dmytro Dzhulgakov (Co-Founder at Fireworks AI). The Cursor team trained Composer 2 on Fireworks by starting with a strong base model (Kimi 2.5) and performing large-scale mid-training on code tokens and web data to learn common patterns and libraries, followed by a large-scale Reinforcement Learning run to learn how to navigate the Cursor harness, call tools, and write correct code. Today's episode dives into the systems and infrastructure challenges of making that large RL run happening, and there were many (!!), from numerical mismatch to global distribution to synchronizing rollouts across asynchronous pipelines to keeping track of expert activation across runs and more. Extremely nerdy in-the-weeds challenges that Federico and Dima were delighted to nerd out on together :) Beyond RL infra, we also discussed Online vs Simulated rollouts, self-summarization for long-horizon agents, environment design ("the most powerful RL environment is the product itself"), and other technical nuggets. PS: We filmed this episode before the SpaceX news, while the Cursor team was still compute-constrained. While Cursor now has all the flops, the takeaways and hurdles crossed ring true for any serious application-level company that is racing to post-train their own models. I believe that more serious application companies will go the way of Cursor and post-train their own models. 00:00 Introduction 00:53 Why Cursor Trained Composer 2 04:55 Specialization vs Bitter Lesson 06:16 Composer 2 Training Recipe 16:32 Scaling RL Infrastructure Globally 23:32 Floating Point Drift 25:11 MoE Sensitivity Explained 26:25 Router Replay Fix 27:19 Real Time RL Loop 31:49 Long Horizon Agents 34:29 Why RL Everywhere 37:34 LLM as Judge Rewards 39:14 RL in Hard Domains 40:13 Build Your Own Environments 44:34 Closing Thoughts

Sonya Huang 🐥

77,958 Aufrufe • vor 1 Monat

Today we’re launching INTELLECT-2: The first decentralized 32B-parameter RL training run open to join for anyone with compute — fully permissionless. Scaling towards frontier reasoning across coding, math and science.

Today we’re launching INTELLECT-2: The first decentralized 32B-parameter RL training run open to join for anyone with compute — fully permissionless. Scaling towards frontier reasoning across coding, math and science.

Prime Intellect

353,663 Aufrufe • vor 1 Jahr

#mixtral #mistral #LLM360 Serving Mixtral and LLM360 on FEDML Nexus AI ( We offer Mixtral model endpoints the cheapest in the market: only $0.0005 / 1K tokens! FEDML embraces open source and open model weights. We believe the future of AI belongs to large-scale open collaboration. Today we are excited to support new advances in open-source foundation models: Mixtral, the latest open-source LLM beating Llama2-70B with Mixture-of-Experts (MoE) architecture, and Amber and CrystalCoder backed by LLM360, the framework for open-source LLMs to foster transparency, trust, and collaborative research. Compared to existing fragmented ML products in the market, FEDML Nexus AI is the next-gen cloud service for LLM and Generative AI. It provides an end-to-end platform backed by serverless/decentralized AI infrastructure. Specifically: 1. Economical Serving Engine, ScaleLLM, is where you run your model in cheaper price by optimizing GPU memory and with fully optimized throughput for supporting more concurrent requests. 2. FEDML® Deploy simplifies CLI and MLOps workflow for model deployment on a serverless GPU cloud or on-premise cluster. 3. Serverless Endpoint runs on serverless GPU clouds. With our pay per use policy, we abstract the responsibility of acquiring or leasing an extensive GPU inventory when your are uncertain about your future AI service traffic. The autoscaling feature seamlessly adjusts the backend GPU resources in response to your service traffic. 4. On-premise Deployment helps you own your LLM model on your local environment with AI safety support. 5. FEDML® Launch for serverless GPU clouds. With one-line CLI, it swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, abstracting complex environment setup and management. 6. Zero-code Fine-tuning supported by FEDML® Studio optimizes your model on your domain-specific data without writing any line of source code. 7. Pre-training LLM supports cluster management and experimental tracking. You maintain your training clusters for your urgent needs in your vertical domain. As a closing note, FEDML is gearing up to unveil a cutting-edge service for LLM-based agents and our own cost-effective LLM. Please stay tuned and keep an eye out for upcoming announcements!

#mixtral #mistral #LLM360 Serving Mixtral and LLM360 on FEDML Nexus AI ( We offer Mixtral model endpoints the cheapest in the market: only $0.0005 / 1K tokens! FEDML embraces open source and open model weights. We believe the future of AI belongs to large-scale open collaboration. Today we are excited to support new advances in open-source foundation models: Mixtral, the latest open-source LLM beating Llama2-70B with Mixture-of-Experts (MoE) architecture, and Amber and CrystalCoder backed by LLM360, the framework for open-source LLMs to foster transparency, trust, and collaborative research. Compared to existing fragmented ML products in the market, FEDML Nexus AI is the next-gen cloud service for LLM and Generative AI. It provides an end-to-end platform backed by serverless/decentralized AI infrastructure. Specifically: 1. Economical Serving Engine, ScaleLLM, is where you run your model in cheaper price by optimizing GPU memory and with fully optimized throughput for supporting more concurrent requests. 2. FEDML® Deploy simplifies CLI and MLOps workflow for model deployment on a serverless GPU cloud or on-premise cluster. 3. Serverless Endpoint runs on serverless GPU clouds. With our pay per use policy, we abstract the responsibility of acquiring or leasing an extensive GPU inventory when your are uncertain about your future AI service traffic. The autoscaling feature seamlessly adjusts the backend GPU resources in response to your service traffic. 4. On-premise Deployment helps you own your LLM model on your local environment with AI safety support. 5. FEDML® Launch for serverless GPU clouds. With one-line CLI, it swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, abstracting complex environment setup and management. 6. Zero-code Fine-tuning supported by FEDML® Studio optimizes your model on your domain-specific data without writing any line of source code. 7. Pre-training LLM supports cluster management and experimental tracking. You maintain your training clusters for your urgent needs in your vertical domain. As a closing note, FEDML is gearing up to unveil a cutting-edge service for LLM-based agents and our own cost-effective LLM. Please stay tuned and keep an eye out for upcoming announcements!

TensorOpera AI

90,271 Aufrufe • vor 2 Jahren

New blackboard lecture w Eric Jang He walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. Timestamps: 0:00:00 – Basics of Go 0:08:06 – Monte Carlo Tree Search 0:31:53 – What the neural network does 1:00:22 – Self-play 1:25:27 – Alternative RL approaches 1:45:36 – Why doesn’t MCTS work for LLMs 2:00:58 – Off-policy training 2:11:51 – RL is even more information inefficient than you thought 2:22:05 – Automated AI researchers

New blackboard lecture w Eric Jang He walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. Timestamps: 0:00:00 – Basics of Go 0:08:06 – Monte Carlo Tree Search 0:31:53 – What the neural network does 1:00:22 – Self-play 1:25:27 – Alternative RL approaches 1:45:36 – Why doesn’t MCTS work for LLMs 2:00:58 – Off-policy training 2:11:51 – RL is even more information inefficient than you thought 2:22:05 – Automated AI researchers

Dwarkesh Patel

694,635 Aufrufe • vor 1 Monat

We asked Sholto Douglas from Anthropic about the costs of RL (Reinforcement Learning) runs. "In Dario Amodei's essay, he said that RL runs cost only $1M back in December." "RL is a more naively parallelizable and scalable than pre-training." "With pre-training, you need everything in one big data center ideally. For RL, in theory, you could scale all over the world."

We asked Sholto Douglas from Anthropic about the costs of RL (Reinforcement Learning) runs. "In Dario Amodei's essay, he said that RL runs cost only $1M back in December." "RL is a more naively parallelizable and scalable than pre-training." "With pre-training, you need everything in one big data center ideally. For RL, in theory, you could scale all over the world."

TBPN

76,634 Aufrufe • vor 1 Jahr