Loading video...

Video Failed to Load

Go Home

Today's Training Data episode takes us BTS on the infrastructure challenges required to do large RL runs at scale, featuring Federico Cassano (Composer Lead at Cursor) and Dmytro Dzhulgakov (Co-Founder at Fireworks AI). The Cursor team trained Composer 2 on Fireworks by starting with a strong base model (Kimi...

77,483 views • 1 month ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Why AI Can Now Make Discoveries - my conversation with Dan Roberts, Lead of the Foundations of Reinforcement Learning team at OpenAI 00:00 Intro: AI's wild week in mathematics 01:21 What OpenAI's Foundations of RL team does 03:08 Dan's journey: from black holes and quantum gravity to frontier AI 07:04 Are AI systems becoming useful for real science 08:21 The AI math moment: Erdős, OpenAI, DeepMind, and Anthropic 08:52 Why the OpenAI result was an act of exploration 10:25 OpenAI vs. DeepMind: informal reasoning vs. formal proof 12:13 RL 101: learning by doing, not just watching 15:10 Why reinforcement learning works 15:58 How RL breaks: sparse feedback and long-horizon tasks 17:03 RLHF: how human feedback shaped early language models 18:48 Move 37, self-play, and the search for novel strategies 22:16 Explore vs. exploit in scientific discovery 24:49 Why RL may now be "the cake," not the cherry on top 25:46 Why RL started working with large language models 27:29 Is RL "sucking supervision through a straw"? 28:47 Why language may be the grounding layer for intelligence 31:46 A contrarian take on the Bitter Lesson 32:41 What test-time compute actually is 34:50 How RL gives models the ability to think 35:40 Verifiable rewards, math, coding, and the messy real world 38:00 What physics can teach us about AI 42:08 Is there a thermodynamics of AI? 43:08 From Erdős problems to Einstein-level AI 45:16 Is AI already doing original science? 45:51 How far are we from AI automating AI research 47:41 Why Dan is excited about the future of science

Matt Turck

64,094 views • 23 days ago

Failing to Understand the Exponential, Again? My conversation with Julian Schrittwieser - Julian Schrittwieser (Anthropic, AlphaGo Zero, MuZero) - on Move 37, Scaling RL, Nobel Prize for AI, and the AI frontier: 00:00 - Cold open: “We’re not seeing any slowdown.” 00:32 - Intro — Meet Julian 01:09 - The “exponential” from inside frontier labs 04:46 - 2026–2027: agents that work a full day; expert-level breadth 08:58 - Benchmarks vs reality: long-horizon work, GDP-Val, user value 10:26 - Move 37 — what actually happened and why it mattered 13:55 - Novel science: AlphaCode/AlphaTensor → when does AI earn a Nobel? 16:25 - Discontinuity vs smooth progress (and warning signs) 19:08 - Does pre-training + RL get us there? (AGI debates aside) 20:55 - Sutton’s “RL from scratch”? Julian’s take 23:03 - Julian’s path: Google → DeepMind → Anthropic 26:45 - AlphaGo (learn + search) in plain English 30:16 - AlphaGo Zero (no human data) 31:00 - AlphaZero (one algorithm: Go, chess, shogi) 31:46 - MuZero (planning with a learned world model) 33:23 -Lessons for today’s agents: search + learning at scale 34:57 - Do LLMs already have implicit world models? 39:02 - Why RL on LLMs took time (stability, feedback loops) 41:43 - Compute & scaling for RL — what we see so far 42:35 - Rewards frontier: human prefs, rubrics, RLVR, process rewards 44:36 - RL training data & the “flywheel” (and why quality matters) 48:02 - RL & Agents 101 — why RL unlocks robustness 50:51 - Should builders use RL-as-a-service? Or just tools + prompts? 52:18 - What’s missing for dependable agents (capability vs engineering) 53:51 - Evals & Goodhart — internal vs external benchmarks 57:35 - Mechanistic interpretability & “Golden Gate Claude” 1:00:03 - Safety & alignment at Anthropic — how it shows up in practice 1:03:48 - Jobs: human–AI complementarity (comparative advantage) 1:06:33 - Inequality, policy, and the case for 10× productivity → abundance 1:09:24 - Closing thoughts

Matt Turck

235,526 views • 8 months ago

Cursor Complete Guide for AI Coding... 1. The Basics, Composer, Cursor 2.0, Why use Cursor? 2. Multiple Agent Testing, Adding Database, Deploying to Vercel 3. Comparing the big 4: v0, Replit, Lovable, Cursor And more... with Senior Software Engineer Kehan Zhang TIME STAMPS --------------- 1. BASICS: 00:00 Introduction 01:01 Overview of Cursor and Its Features 01:47 Getting Started with Cursor 02:39 Understanding IDE and Vibe Coding 06:00 Cursor For Mobile Apps 10:26 Downloading and Installing Cursor 11:17 Creating and Managing Projects in Cursor 15:14 Building a Simple Game with Cursor 19:10 Advanced Features and Customization 40:28 Fixing Styling Rules 40:53 Redesigning the App 42:17 Exploring Cursor 2.0 Features 43:22 Setting Up the Project Structure 44:17 Adding and Testing Meme Templates 46:08 Debugging Text Issues 2. ADVANCED 49:46 Using Multiple Agents 01:10:40 Creating Custom Commands 01:14:15 Creating Commands in Settings Tab 01:15:11 Introduction to Instant DB 01:16:04 Setting Up Instant DB in Your Project 01:18:24 Building a Full Stack Application 01:19:04 Using the Agent to Plan and Build 01:26:06 Testing and Debugging the Application 01:53:02 Deploying the Application with Vercel 01:55:35 Setting Up the CLI 01:56:15 Understanding Command Line Interfaces (CLI) 01:57:32 Deploying Code to Vercel 01:58:07 Handling Environment Variables 01:58:44 Interacting with the Vercel Deployment 02:00:34 Exploring Cursor's Capabilities 3. COMPARING VIBE CODING TOOLS 02:09:48 Comparing Vibe Coding Tools 02:31:04 Final Thoughts and Recommendations

Riley Brown

65,179 views • 7 months ago

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 views • 11 months ago