Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

🚨 RL for LLMs is finally accessible. Introducing OpenTinker: The first community-driven, open-source framework designed to democratize Reinforcement Learning for LLMs. Inspired by Thinking Machines's amazing Tinker, we realize the biggest bottleneck in agentic LLM research isn’t the math—it’s the setup. Current RL pipelines are messy. Configuring VeRL for...

58,120 görüntüleme • 6 ay önce •via X (Twitter)

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 görüntüleme • 11 ay önce

Today's Training Data episode takes us BTS on the infrastructure challenges required to do large RL runs at scale, featuring Federico Cassano (Composer Lead at Cursor) and Dmytro Dzhulgakov (Co-Founder at Fireworks AI). The Cursor team trained Composer 2 on Fireworks by starting with a strong base model (Kimi 2.5) and performing large-scale mid-training on code tokens and web data to learn common patterns and libraries, followed by a large-scale Reinforcement Learning run to learn how to navigate the Cursor harness, call tools, and write correct code. Today's episode dives into the systems and infrastructure challenges of making that large RL run happening, and there were many (!!), from numerical mismatch to global distribution to synchronizing rollouts across asynchronous pipelines to keeping track of expert activation across runs and more. Extremely nerdy in-the-weeds challenges that Federico and Dima were delighted to nerd out on together :) Beyond RL infra, we also discussed Online vs Simulated rollouts, self-summarization for long-horizon agents, environment design ("the most powerful RL environment is the product itself"), and other technical nuggets. PS: We filmed this episode before the SpaceX news, while the Cursor team was still compute-constrained. While Cursor now has *all* the flops, the takeaways and hurdles crossed ring true for any serious application-level company that is racing to post-train their own models. I believe that more serious application companies will go the way of Cursor and post-train their own models. 00:00 Introduction 00:53 Why Cursor Trained Composer 2 04:55 Specialization vs Bitter Lesson 06:16 Composer 2 Training Recipe 16:32 Scaling RL Infrastructure Globally 23:32 Floating Point Drift 25:11 MoE Sensitivity Explained 26:25 Router Replay Fix 27:19 Real Time RL Loop 31:49 Long Horizon Agents 34:29 Why RL Everywhere 37:34 LLM as Judge Rewards 39:14 RL in Hard Domains 40:13 Build Your Own Environments 44:34 Closing Thoughts

Sonya Huang 🐥

77,958 görüntüleme • 1 ay önce

#mixtral #mistral #LLM360 Serving Mixtral and LLM360 on FEDML Nexus AI ( We offer Mixtral model endpoints the cheapest in the market: only $0.0005 / 1K tokens! FEDML embraces open source and open model weights. We believe the future of AI belongs to large-scale open collaboration. Today we are excited to support new advances in open-source foundation models: Mixtral, the latest open-source LLM beating Llama2-70B with Mixture-of-Experts (MoE) architecture, and Amber and CrystalCoder backed by LLM360, the framework for open-source LLMs to foster transparency, trust, and collaborative research. Compared to existing fragmented ML products in the market, FEDML Nexus AI is the next-gen cloud service for LLM and Generative AI. It provides an end-to-end platform backed by serverless/decentralized AI infrastructure. Specifically: 1. Economical Serving Engine, ScaleLLM, is where you run your model in cheaper price by optimizing GPU memory and with fully optimized throughput for supporting more concurrent requests. 2. FEDML® Deploy simplifies CLI and MLOps workflow for model deployment on a serverless GPU cloud or on-premise cluster. 3. Serverless Endpoint runs on serverless GPU clouds. With our pay per use policy, we abstract the responsibility of acquiring or leasing an extensive GPU inventory when your are uncertain about your future AI service traffic. The autoscaling feature seamlessly adjusts the backend GPU resources in response to your service traffic. 4. On-premise Deployment helps you own your LLM model on your local environment with AI safety support. 5. FEDML® Launch for serverless GPU clouds. With one-line CLI, it swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, abstracting complex environment setup and management. 6. Zero-code Fine-tuning supported by FEDML® Studio optimizes your model on your domain-specific data without writing any line of source code. 7. Pre-training LLM supports cluster management and experimental tracking. You maintain your training clusters for your urgent needs in your vertical domain. As a closing note, FEDML is gearing up to unveil a cutting-edge service for LLM-based agents and our own cost-effective LLM. Please stay tuned and keep an eye out for upcoming announcements!

TensorOpera AI

90,271 görüntüleme • 2 yıl önce

New blackboard lecture w Eric Jang He walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. Timestamps: 0:00:00 – Basics of Go 0:08:06 – Monte Carlo Tree Search 0:31:53 – What the neural network does 1:00:22 – Self-play 1:25:27 – Alternative RL approaches 1:45:36 – Why doesn’t MCTS work for LLMs 2:00:58 – Off-policy training 2:11:51 – RL is even more information inefficient than you thought 2:22:05 – Automated AI researchers

Dwarkesh Patel

694,635 görüntüleme • 1 ay önce