正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

New Benchtalks with John Yang: on ProgramBench (0% frontier models at launch) and the lineage/future of coding benchmarks, from SWE-bench/InterCode to now 01:29 ProgramBench launch and reception 03:41 Why artifact-level evaluation, not code-level 06:03 Why models love Python 08:29 ProgramBench as a research tool 12:45 From SWE-bench & InterCode... to ProgramBench 17:47 How to grade a coding model 21:53 The position paper & humans in the loop 25:01 Managing quality with agents-in-the-loop 28:40 Internet access and benchmark integrity 35:26 Where models may surpass human abilities 38:56 When a model hits 80% on ProgramBench 43:55 Benchmarks worth paying attention to 46:24 What benchmark do you wish existed 49:32 Will benchmarks still look like benchmarks in 5 years 52:02 How to contribute to ProgramBenchshow more

vincent sunn chen

1,458 subscribers

26,170 次观看 • 22 天前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works. I talked to Alex Shaw about what happens when model capabilities climb faster than we can measure them. His answer: the benchmark factory (Harbor Framework)— infrastructure to develop hard, representative evals at the pace that the frontier moves. As Alex put it: "we need a thousand times more benchmarks than we have right now." 00:23 - How quickly models hill-climbed TB2 01:46 - What rapid progress reveals about benchmarks vs. real-world capability 03:28 - What made Terminal-Bench stick 04:58 - Why the terminal is the right abstraction for agentic AI 07:14 - How TB2 maintains task quality at scale 09:23 - Managing benchmark integrity in a benchmaxxing world 10:47 - Harbor: from experiment to benchmark factory 12:19 - What Harbor does that nothing else did 14:37 - The invariants: what won't change as agent evals evolve 16:55 - The benchmark Alex most wants to see built 18:18 - The ideal human-in-the-loop task creation flywheel 20:32 - How to contribute to Terminal-Bench 3.0

Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works. I talked to Alex Shaw about what happens when model capabilities climb faster than we can measure them. His answer: the benchmark factory (Harbor Framework)— infrastructure to develop hard, representative evals at the pace that the frontier moves. As Alex put it: "we need a thousand times more benchmarks than we have right now." 00:23 - How quickly models hill-climbed TB2 01:46 - What rapid progress reveals about benchmarks vs. real-world capability 03:28 - What made Terminal-Bench stick 04:58 - Why the terminal is the right abstraction for agentic AI 07:14 - How TB2 maintains task quality at scale 09:23 - Managing benchmark integrity in a benchmaxxing world 10:47 - Harbor: from experiment to benchmark factory 12:19 - What Harbor does that nothing else did 14:37 - The invariants: what won't change as agent evals evolve 16:55 - The benchmark Alex most wants to see built 18:18 - The ideal human-in-the-loop task creation flywheel 20:32 - How to contribute to Terminal-Bench 3.0

vincent sunn chen

11,500 次观看 • 2 个月前

Are AI benchmarks doomed? Greg Burnham and Tom Adamczewski join Anson Ho to push back on benchmark pessimism and dig into what the next generation of AI benchmarks could look like. (0:00:00) - Preview (0:00:36) - Intro: Are AI benchmarks doomed? (0:03:13) - The costs and benefits of benchmark development (0:11:48) - MirrorCode and scalable benchmarks (0:20:57) - AI speed-up in benchmark development (0:23:28) - The benchmark-reality gap (0:38:26) - Can an AGI benchmark exist? (0:43:18) - Beyond automated scoring (1:00:45) - How AI changes benchmark building in practice

Are AI benchmarks doomed? Greg Burnham and Tom Adamczewski join Anson Ho to push back on benchmark pessimism and dig into what the next generation of AI benchmarks could look like. (0:00:00) - Preview (0:00:36) - Intro: Are AI benchmarks doomed? (0:03:13) - The costs and benefits of benchmark development (0:11:48) - MirrorCode and scalable benchmarks (0:20:57) - AI speed-up in benchmark development (0:23:28) - The benchmark-reality gap (0:38:26) - Can an AGI benchmark exist? (0:43:18) - Beyond automated scoring (1:00:45) - How AI changes benchmark building in practice

Epoch AI

22,037 次观看 • 1 个月前

Get the inside story on the development of Gemini's coding capabilities. Listen as the product and research leads for Gemini share their philosophy on what makes a great coding model, the impact of "vibe coding," and the future of programming languages with Logan Kilpatrick, Connie Fan and Danny Tarlow. Timecodes: 0:00 Intro 1:10 Defining Early Coding Goals 6:23 Ingredients of a Great Coding Model 9:28 Adapting to Developer Workflows 11:40 The Rise of Vibe Coding 14:43 Code as a Reasoning Tool 17:20 Code as a Universal Solver 20:47 Evaluating Coding Models 24:30 Leveraging Internal Googler Feedback 26:52 Winning Over AI Skeptics 28:04 Performance Across Programming Languages 33:05 The Future of Programming Languages 36:16 Strategies for Large Codebases 41:06 Hill Climbing New Benchmarks 42:46 Short-Term Improvements 44:42 Model Style and Taste 47:43 2.5 Pro’s Breakthrough 51:06 Early AI Coding Experiences 56:19 Specialist vs. Generalist Models

Get the inside story on the development of Gemini's coding capabilities. Listen as the product and research leads for Gemini share their philosophy on what makes a great coding model, the impact of "vibe coding," and the future of programming languages with Logan Kilpatrick, Connie Fan and Danny Tarlow. Timecodes: 0:00 Intro 1:10 Defining Early Coding Goals 6:23 Ingredients of a Great Coding Model 9:28 Adapting to Developer Workflows 11:40 The Rise of Vibe Coding 14:43 Code as a Reasoning Tool 17:20 Code as a Universal Solver 20:47 Evaluating Coding Models 24:30 Leveraging Internal Googler Feedback 26:52 Winning Over AI Skeptics 28:04 Performance Across Programming Languages 33:05 The Future of Programming Languages 36:16 Strategies for Large Codebases 41:06 Hill Climbing New Benchmarks 42:46 Short-Term Improvements 44:42 Model Style and Taste 47:43 2.5 Pro’s Breakthrough 51:06 Early AI Coding Experiences 56:19 Specialist vs. Generalist Models

Google AI Developers

65,474 次观看 • 1 年前

François Chollet (François Chollet) has spent years asking a different question than most of the AI world. Instead of scaling what already works, he’s trying to understand what intelligence actually is and how to build it from first principles. In this episode of the Lightcone Podcast, he traces that path from his early work on deep learning to the creation of the ARC Prize, and the launch of ARC V3, a new benchmark designed to measure something deeper than performance: the ability to learn, adapt, and reason efficiently in entirely new environments. He explains why today’s systems may be hitting limits, what recent breakthroughs really mean, and why reaching true general intelligence may require a fundamentally different approach. 00:00 - AGI by 2030? 00:31 - Introducing Ndea: A New Path Beyond Deep Learning 01:08 - A New ML Paradigm 01:30 - Replacing neural nets with compact symbolic programs 03:04 - Why Ndea Isn’t Competing With Coding Agents 05:20 - Why Everyone Might Be Wrong About Scaling LLMs 07:22 - Why Coding Agents Suddenly Work So Well 08:50 - The Limits of LLMs in Non-Verifiable Domains 10:48 - What AGI Actually Means (And Why Most Definitions Are Wrong) 13:30 - Why Deep Learning Hits a Wall 14:00 - ARC’s Origin Story 18:20 - ARC Benchmarks Explained: From V1 to V3 22:49 - The RL Loop Powering Coding Agents Today 27:03 - ARC-AGI V3: Measuring “Agentic Intelligence” 31:14 - Inside the ARC Game Studio 35:31 - Could AGI Fit in 10,000 Lines of Code? 44:01 - Building Ndea: From Idea to Compounding Research Stack 46:46 - The Future of ARC: Benchmarks That Evolve With AI 47:21 - Why There’s Still Huge Opportunity for New AI Paradigms 53:37 - How to Build a Breakout Open Source Project - Lessons From Keras 56:39 - Advice For How To Think About AI

François Chollet (François Chollet) has spent years asking a different question than most of the AI world. Instead of scaling what already works, he’s trying to understand what intelligence actually is and how to build it from first principles. In this episode of the Lightcone Podcast, he traces that path from his early work on deep learning to the creation of the ARC Prize, and the launch of ARC V3, a new benchmark designed to measure something deeper than performance: the ability to learn, adapt, and reason efficiently in entirely new environments. He explains why today’s systems may be hitting limits, what recent breakthroughs really mean, and why reaching true general intelligence may require a fundamentally different approach. 00:00 - AGI by 2030? 00:31 - Introducing Ndea: A New Path Beyond Deep Learning 01:08 - A New ML Paradigm 01:30 - Replacing neural nets with compact symbolic programs 03:04 - Why Ndea Isn’t Competing With Coding Agents 05:20 - Why Everyone Might Be Wrong About Scaling LLMs 07:22 - Why Coding Agents Suddenly Work So Well 08:50 - The Limits of LLMs in Non-Verifiable Domains 10:48 - What AGI Actually Means (And Why Most Definitions Are Wrong) 13:30 - Why Deep Learning Hits a Wall 14:00 - ARC’s Origin Story 18:20 - ARC Benchmarks Explained: From V1 to V3 22:49 - The RL Loop Powering Coding Agents Today 27:03 - ARC-AGI V3: Measuring “Agentic Intelligence” 31:14 - Inside the ARC Game Studio 35:31 - Could AGI Fit in 10,000 Lines of Code? 44:01 - Building Ndea: From Idea to Compounding Research Stack 46:46 - The Future of ARC: Benchmarks That Evolve With AI 47:21 - Why There’s Still Huge Opportunity for New AI Paradigms 53:37 - How to Build a Breakout Open Source Project - Lessons From Keras 56:39 - Advice For How To Think About AI

Y Combinator

151,141 次观看 • 3 个月前

We are excited to share that Logan Kilpatrick joined us on The Bench to discuss Google's new Gemini 3.5 Flash: why it's deliberately more persistent and capable than previous Flash models, how it hit #1 on our FinanceAgent Benchmark taking 82 steps where competitors stopped at 13, and what justifies the price increase. We also get into why AI benchmarks need a paradigm shift, the trade-off of building everything vs staying focused, the Pope, and why Omni might kill the Subway Surfers content era. 0:11:00 – Flash is being rebased for the agent era, not just a cheaper model anymore 0:14:03 – Persistence by design: 82 tool calls vs competitors' 13 0:17:52 – Why pricing went up and how Google thinks about value per token 0:22:55 – Coding performance: from 20th to 10th place in one generation 0:28:28 – Why benchmarks have historically been misleading and what the new era of evaluation looks like 0:29:28 Logan on why Google has the best researchers in the world 0:36:16 – The cost of being Google 0:39:07 – The Pope’s encyclical on AI and whether most people see frontier intelligence as a good thing 0:51:12 – Why Omni is the thing that recently clicked

We are excited to share that Logan Kilpatrick joined us on The Bench to discuss Google's new Gemini 3.5 Flash: why it's deliberately more persistent and capable than previous Flash models, how it hit #1 on our FinanceAgent Benchmark taking 82 steps where competitors stopped at 13, and what justifies the price increase. We also get into why AI benchmarks need a paradigm shift, the trade-off of building everything vs staying focused, the Pope, and why Omni might kill the Subway Surfers content era. 0:11:00 – Flash is being rebased for the agent era, not just a cheaper model anymore 0:14:03 – Persistence by design: 82 tool calls vs competitors' 13 0:17:52 – Why pricing went up and how Google thinks about value per token 0:22:55 – Coding performance: from 20th to 10th place in one generation 0:28:28 – Why benchmarks have historically been misleading and what the new era of evaluation looks like 0:29:28 Logan on why Google has the best researchers in the world 0:36:16 – The cost of being Google 0:39:07 – The Pope’s encyclical on AI and whether most people see frontier intelligence as a good thing 0:51:12 – Why Omni is the thing that recently clicked

Vals AI

16,345 次观看 • 28 天前

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. OpenAI GPT 5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1.

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. OpenAI GPT 5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1.

adarsh

207,839 次观看 • 3 个月前

François Chollet (François Chollet) on the ARC Prize and how we get to AGI. At AI Startup School in San Francisco. 00:00 - The Falling Cost of Compute 00:57 - Deep-Learning’s Scaling Era & Benchmarks 01:59 - The ARC Benchmark 03:02 - The 2024 Shift to Test-Time Adaptation 05:01 - What Is Intelligence? 07:12 - Why Benchmarks Matter (and Mislead) 08:57 - ARC 1 Exposes Scaling Limits 10:58 - ARC 2: Compositional Reasoning Arrives 12:55 - Humans vs. Models on ARC2 14:58 - Previewing ARC3 & Interactive Agency 17:00 - Kaleidoscopic Hypothesis and Abstractions 22:00 - Type 1 vs. Type 2 Abstractions 26:00 - Discrete Program Search & Inventive AI 29:00 - Fusing Intuition with Symbolic Reasoning 32:00 - Building AGI Through Meta-Learning Systems

François Chollet (François Chollet) on the ARC Prize and how we get to AGI. At AI Startup School in San Francisco. 00:00 - The Falling Cost of Compute 00:57 - Deep-Learning’s Scaling Era & Benchmarks 01:59 - The ARC Benchmark 03:02 - The 2024 Shift to Test-Time Adaptation 05:01 - What Is Intelligence? 07:12 - Why Benchmarks Matter (and Mislead) 08:57 - ARC 1 Exposes Scaling Limits 10:58 - ARC 2: Compositional Reasoning Arrives 12:55 - Humans vs. Models on ARC2 14:58 - Previewing ARC3 & Interactive Agency 17:00 - Kaleidoscopic Hypothesis and Abstractions 22:00 - Type 1 vs. Type 2 Abstractions 26:00 - Discrete Program Search & Inventive AI 29:00 - Fusing Intuition with Symbolic Reasoning 32:00 - Building AGI Through Meta-Learning Systems

Y Combinator

231,725 次观看 • 11 个月前

ARC-AGI is redefining how to measure progress on the path to AGI - focusing on reasoning, generalization, and adaptability instead of memorization or scale. At NeurIPS 2025, YC's Diana sat down with ARC Prize President Greg Kamradt to find out why most AI benchmarks fail, how ARC-AGI reveals the limits of today’s models, and why measuring intelligence may be harder than building it. 00:11 — What ARC Prize is and why it exists 00:38 — François Chollet’s definition of AGI 01:48 — What ARC-AGI Actually Tests 02:25 — When LLMs Failed the ARC Benchmark 03:38 — ARC-AGI Becomes the Standard 04:49 — False Positives in AI Progress 06:06 — The Evolution of ARC-AGI 08:55 — Measuring Intelligence beyond just accuracy 10:25 — What happens if a model solves ARC-AGI?

ARC-AGI is redefining how to measure progress on the path to AGI - focusing on reasoning, generalization, and adaptability instead of memorization or scale. At NeurIPS 2025, YC's Diana sat down with ARC Prize President Greg Kamradt to find out why most AI benchmarks fail, how ARC-AGI reveals the limits of today’s models, and why measuring intelligence may be harder than building it. 00:11 — What ARC Prize is and why it exists 00:38 — François Chollet’s definition of AGI 01:48 — What ARC-AGI Actually Tests 02:25 — When LLMs Failed the ARC Benchmark 03:38 — ARC-AGI Becomes the Standard 04:49 — False Positives in AI Progress 06:06 — The Evolution of ARC-AGI 08:55 — Measuring Intelligence beyond just accuracy 10:25 — What happens if a model solves ARC-AGI?

Y Combinator

98,369 次观看 • 6 个月前

Introducing Arena Mode in Windsurf: One prompt. Two models. Your vote. Benchmarks don't reflect real-world coding quality. The best model for you depends on your codebase and stack. So we made real-world coding the benchmark. Free for the next week. May the best model win.

Introducing Arena Mode in Windsurf: One prompt. Two models. Your vote. Benchmarks don't reflect real-world coding quality. The best model for you depends on your codebase and stack. So we made real-world coding the benchmark. Free for the next week. May the best model win.

Devin Desktop

1,061,420 次观看 • 4 个月前

How does math research change when the cost of trying your first dumb idea goes to zero? Daniel Litt joins Greg Burnham and Anson Ho to discuss what today’s models can and can’t do in math, and how far they are from doing high-quality research. 0:00:00 What's the hardest math problem AI can solve today? 00:16:08 How helpful are today’s AI models for math research? 00:23:36 Junk papers, LLM-generated proofs, and the refereeing crisis 00:27:21 AI enables searching through problems at scale 00:33:49 When will AI be good enough to publish in top math journals? 00:42:15 What are the returns to intelligence? 00:59:50 Will AI solve Millennium problems? 01:11:54 Is math full of low-hanging fruit? 01:18:47 How Daniel has adapted his professional life to AI progress 01:25:28 What do AI math benchmarks actually measure? 01:33:05 Designing the Open Problems benchmark 01:56:35 Do mathematicians believe heuristic arguments about conjectures? 02:01:24 What if FrontierMath: Open Problems gets solved? 02:06:53 Is AI on the cusp of accelerating math progress?

How does math research change when the cost of trying your first dumb idea goes to zero? Daniel Litt joins Greg Burnham and Anson Ho to discuss what today’s models can and can’t do in math, and how far they are from doing high-quality research. 0:00:00 What's the hardest math problem AI can solve today? 00:16:08 How helpful are today’s AI models for math research? 00:23:36 Junk papers, LLM-generated proofs, and the refereeing crisis 00:27:21 AI enables searching through problems at scale 00:33:49 When will AI be good enough to publish in top math journals? 00:42:15 What are the returns to intelligence? 00:59:50 Will AI solve Millennium problems? 01:11:54 Is math full of low-hanging fruit? 01:18:47 How Daniel has adapted his professional life to AI progress 01:25:28 What do AI math benchmarks actually measure? 01:33:05 Designing the Open Problems benchmark 01:56:35 Do mathematicians believe heuristic arguments about conjectures? 02:01:24 What if FrontierMath: Open Problems gets solved? 02:06:53 Is AI on the cusp of accelerating math progress?

Epoch AI

178,703 次观看 • 4 个月前

How GPT-5 thinks, with OpenAI VP of Research Jerry Tworek 00:00 - Intro 01:01 - What Reasoning Actually Means in AI 02:32 - Chain of Thought: Models Thinking in Words 05:25 - How Models Decide How Long to Think 07:24 - Evolution from o1 to o3 to GPT-5 11:00 - The Road to OpenAI: Growing up in Poland, Dropping out of School, Trading 20:32 - Working on Robotics and Rubik's Cube Solving 23:02 - A Day in the Life: Talking to Researchers 24:06 - How Research Priorities Are Determined 26:53 - OpenAI's Culture of Transparency 29:32 - Balancing Research with Shipping Fast 31:52 - Using OpenAI's Own Tools Daily 32:43 - Pre-Training Plus RL: The Modern AI Stack 35:10 - Reinforcement Learning 101: Training Dogs 40:17 - The Evolution of Deep Reinforcement Learning 42:09 - When GPT-4 Seemed Underwhelming at First 45:39 - How RLHF Made GPT-4 Actually Useful 48:02 - Unsupervised vs Supervised Learning 49:59 - GRPO and How DeepSeek Accelerated US Research 53:05 - What It Takes to Scale Reinforcement Learning 55:36 - Agentic AI and Long-Horizon Thinking 59:19 - Alignment as an RL Problem 1:01:11 - Winning ICPC World Finals Without Specific Training 1:05:53 - Applying RL Beyond Math and Coding 1:09:15 - The Path from Here to AGI 1:12:23 - Pure RL vs Language Models

How GPT-5 thinks, with OpenAI VP of Research Jerry Tworek 00:00 - Intro 01:01 - What Reasoning Actually Means in AI 02:32 - Chain of Thought: Models Thinking in Words 05:25 - How Models Decide How Long to Think 07:24 - Evolution from o1 to o3 to GPT-5 11:00 - The Road to OpenAI: Growing up in Poland, Dropping out of School, Trading 20:32 - Working on Robotics and Rubik's Cube Solving 23:02 - A Day in the Life: Talking to Researchers 24:06 - How Research Priorities Are Determined 26:53 - OpenAI's Culture of Transparency 29:32 - Balancing Research with Shipping Fast 31:52 - Using OpenAI's Own Tools Daily 32:43 - Pre-Training Plus RL: The Modern AI Stack 35:10 - Reinforcement Learning 101: Training Dogs 40:17 - The Evolution of Deep Reinforcement Learning 42:09 - When GPT-4 Seemed Underwhelming at First 45:39 - How RLHF Made GPT-4 Actually Useful 48:02 - Unsupervised vs Supervised Learning 49:59 - GRPO and How DeepSeek Accelerated US Research 53:05 - What It Takes to Scale Reinforcement Learning 55:36 - Agentic AI and Long-Horizon Thinking 59:19 - Alignment as an RL Problem 1:01:11 - Winning ICPC World Finals Without Specific Training 1:05:53 - Applying RL Beyond Math and Coding 1:09:15 - The Path from Here to AGI 1:12:23 - Pure RL vs Language Models

Matt Turck

451,229 次观看 • 8 个月前

Why AI Can Now Make Discoveries - my conversation with Dan Roberts, Lead of the Foundations of Reinforcement Learning team at OpenAI 00:00 Intro: AI's wild week in mathematics 01:21 What OpenAI's Foundations of RL team does 03:08 Dan's journey: from black holes and quantum gravity to frontier AI 07:04 Are AI systems becoming useful for real science 08:21 The AI math moment: Erdős, OpenAI, DeepMind, and Anthropic 08:52 Why the OpenAI result was an act of exploration 10:25 OpenAI vs. DeepMind: informal reasoning vs. formal proof 12:13 RL 101: learning by doing, not just watching 15:10 Why reinforcement learning works 15:58 How RL breaks: sparse feedback and long-horizon tasks 17:03 RLHF: how human feedback shaped early language models 18:48 Move 37, self-play, and the search for novel strategies 22:16 Explore vs. exploit in scientific discovery 24:49 Why RL may now be "the cake," not the cherry on top 25:46 Why RL started working with large language models 27:29 Is RL "sucking supervision through a straw"? 28:47 Why language may be the grounding layer for intelligence 31:46 A contrarian take on the Bitter Lesson 32:41 What test-time compute actually is 34:50 How RL gives models the ability to think 35:40 Verifiable rewards, math, coding, and the messy real world 38:00 What physics can teach us about AI 42:08 Is there a thermodynamics of AI? 43:08 From Erdős problems to Einstein-level AI 45:16 Is AI already doing original science? 45:51 How far are we from AI automating AI research 47:41 Why Dan is excited about the future of science

Why AI Can Now Make Discoveries - my conversation with Dan Roberts, Lead of the Foundations of Reinforcement Learning team at OpenAI 00:00 Intro: AI's wild week in mathematics 01:21 What OpenAI's Foundations of RL team does 03:08 Dan's journey: from black holes and quantum gravity to frontier AI 07:04 Are AI systems becoming useful for real science 08:21 The AI math moment: Erdős, OpenAI, DeepMind, and Anthropic 08:52 Why the OpenAI result was an act of exploration 10:25 OpenAI vs. DeepMind: informal reasoning vs. formal proof 12:13 RL 101: learning by doing, not just watching 15:10 Why reinforcement learning works 15:58 How RL breaks: sparse feedback and long-horizon tasks 17:03 RLHF: how human feedback shaped early language models 18:48 Move 37, self-play, and the search for novel strategies 22:16 Explore vs. exploit in scientific discovery 24:49 Why RL may now be "the cake," not the cherry on top 25:46 Why RL started working with large language models 27:29 Is RL "sucking supervision through a straw"? 28:47 Why language may be the grounding layer for intelligence 31:46 A contrarian take on the Bitter Lesson 32:41 What test-time compute actually is 34:50 How RL gives models the ability to think 35:40 Verifiable rewards, math, coding, and the messy real world 38:00 What physics can teach us about AI 42:08 Is there a thermodynamics of AI? 43:08 From Erdős problems to Einstein-level AI 45:16 Is AI already doing original science? 45:51 How far are we from AI automating AI research 47:41 Why Dan is excited about the future of science

Matt Turck

64,094 次观看 • 21 天前

My interview with Amazon Web Services CEO Matt Garman. 0:00 Intro 0:57 White Collar Jobs 8:51 How much of AWS's code is written by AI? 12:54 How to have a career in the AI era 15:43 AI Bottlenecks 18:06 Inference vs Training Growth 20:05 AWS Custom Silicon 25:50 Annapurna Acquisition 27:53 AI Models 33:35 Open vs Closed Models 41:28 Benchmarks 47:13 Agents

My interview with Amazon Web Services CEO Matt Garman. 0:00 Intro 0:57 White Collar Jobs 8:51 How much of AWS's code is written by AI? 12:54 How to have a career in the AI era 15:43 AI Bottlenecks 18:06 Inference vs Training Growth 20:05 AWS Custom Silicon 25:50 Annapurna Acquisition 27:53 AI Models 33:35 Open vs Closed Models 41:28 Benchmarks 47:13 Agents

Matthew Berman

49,126 次观看 • 10 个月前

As detailed in the Meta Movie Gen technical report, today we’re open sourcing Movie Gen Bench: two new media generation benchmarks that we hope will help to enable the AI research community to progress work on more capable audio and video generation models. Movie Gen Video Bench is the largest and most comprehensive benchmark ever released for evaluating text-to-video generation. It includes a collection of 1,000+ prompts that cover concepts ranging from detailed human activity to animals, physics, unusual subjects and more — with broad coverage across different motion levels. Movie Gen Audio Bench is a first-of-its-kind benchmark aimed at evaluating video-to-audio and (text+video)-to-audio generation. It includes 527 generated videos and associated sound effects and music prompts covering a diverse set of ambient environments and sound effects. To enable fair and easy comparison to our models for future works, these new benchmarks include non cherry-picked generated videos and audio from Movie Gen. In releasing these new benchmarks we hope to promote fair & extensive evaluations in media generation research to enable greater progress in this field.

As detailed in the Meta Movie Gen technical report, today we’re open sourcing Movie Gen Bench: two new media generation benchmarks that we hope will help to enable the AI research community to progress work on more capable audio and video generation models. Movie Gen Video Bench is the largest and most comprehensive benchmark ever released for evaluating text-to-video generation. It includes a collection of 1,000+ prompts that cover concepts ranging from detailed human activity to animals, physics, unusual subjects and more — with broad coverage across different motion levels. Movie Gen Audio Bench is a first-of-its-kind benchmark aimed at evaluating video-to-audio and (text+video)-to-audio generation. It includes 527 generated videos and associated sound effects and music prompts covering a diverse set of ambient environments and sound effects. To enable fair and easy comparison to our models for future works, these new benchmarks include non cherry-picked generated videos and audio from Movie Gen. In releasing these new benchmarks we hope to promote fair & extensive evaluations in media generation research to enable greater progress in this field.

AI at Meta

156,240 次观看 • 1 年前

Andon Labs' Real-World AI Evals: Claude calls the FBI, AI CEOs, price cartels, Butter-Bench, & Luna Andon Labs cofounders Lukas Petersson and Axel Backlund explain why dollar-denominated evals reveal what traditional benchmarks miss, how Claude ended up reporting a $2/day vending machine fee to the FBI, why long-horizon agents spiral in weird ways, what happens when agents lie, form price cartels, and compete with each other, and why the future of AI safety may depend on testing models in messy real-world environments instead of clean benchmark sandboxes.

Andon Labs' Real-World AI Evals: Claude calls the FBI, AI CEOs, price cartels, Butter-Bench, & Luna Andon Labs cofounders Lukas Petersson and Axel Backlund explain why dollar-denominated evals reveal what traditional benchmarks miss, how Claude ended up reporting a $2/day vending machine fee to the FBI, why long-horizon agents spiral in weird ways, what happens when agents lie, form price cartels, and compete with each other, and why the future of AI safety may depend on testing models in messy real-world environments instead of clean benchmark sandboxes.

Latent.Space

12,957 次观看 • 21 天前

Sonnet 4.5 & the AI Plateau Myth — epic conversation with Sholto Douglas of Anthropic 0:00 - Intro 1:09 - What's Behind The Rapid Pace of AI Releases at Anthropic 2:49 - Opus, Sonnet, and Haiku Model Tiers 4:14 - Sholto's Story: From Australian Fencer to AI Researcher 12:01 - The YouTube Effect: Mastery Through Observation 16:16 - Breaking Into AI Research Without Traditional Academic Signals 18:29 - DeepMind, Gemini, and Building Inference Stacks 23:05 - Why Anthropic? Culture and Mission Differences Amongst AI Research Labs 25:08 - What Is "Taste" in AI Research? 31:46 - This Week's Big Launch: Sonnet 4.5, Best Coding Model in the World 36:40 - From 7 Hours to 30 Hours: The Long-Running AI Agent Breakthrough 38:41 - How AI Agents Self-Correct and Maintain Coherence 43:13 - The Role of Memory in Extended Coding Sessions 47:42 - Pre-Training vs. RL: Textbooks vs. Worked Problems 52:11 - Test-Time Compute & Reinforcement Learning 55:55 - Why RL Finally Started Working on LLMs in 2024 59:38 - The Path to AGI 1:02:05 - Are We Hitting a Plateau in AI? So Many Low Hanging Fruits 1:03:41 - Beyond Coding: GDPVal & Impact Economic Sectors 1:05:47 - Preparing for 10-100x Individual Leverage & The Upcoming Robotics Explosion

Sonnet 4.5 & the AI Plateau Myth — epic conversation with Sholto Douglas of Anthropic 0:00 - Intro 1:09 - What's Behind The Rapid Pace of AI Releases at Anthropic 2:49 - Opus, Sonnet, and Haiku Model Tiers 4:14 - Sholto's Story: From Australian Fencer to AI Researcher 12:01 - The YouTube Effect: Mastery Through Observation 16:16 - Breaking Into AI Research Without Traditional Academic Signals 18:29 - DeepMind, Gemini, and Building Inference Stacks 23:05 - Why Anthropic? Culture and Mission Differences Amongst AI Research Labs 25:08 - What Is "Taste" in AI Research? 31:46 - This Week's Big Launch: Sonnet 4.5, Best Coding Model in the World 36:40 - From 7 Hours to 30 Hours: The Long-Running AI Agent Breakthrough 38:41 - How AI Agents Self-Correct and Maintain Coherence 43:13 - The Role of Memory in Extended Coding Sessions 47:42 - Pre-Training vs. RL: Textbooks vs. Worked Problems 52:11 - Test-Time Compute & Reinforcement Learning 55:55 - Why RL Finally Started Working on LLMs in 2024 59:38 - The Path to AGI 1:02:05 - Are We Hitting a Plateau in AI? So Many Low Hanging Fruits 1:03:41 - Beyond Coding: GDPVal & Impact Economic Sectors 1:05:47 - Preparing for 10-100x Individual Leverage & The Upcoming Robotics Explosion

Matt Turck

93,678 次观看 • 8 个月前

Thanksgiving-week treat: an epic conversation on Frontier AI with Lukasz Kaiser -co-author of “Attention Is All You Need” (Transformers) and leading research scientist at OpenAI working on GPT-5.1-era reasoning models. 00:00 – Cold open and intro 01:29 – “AI slowdown” vs a wild week of new frontier models 08:03 – Low-hanging fruit, infra, RL training and better data 11:39 – What is a reasoning model, in plain language 17:02 – Chain-of-thought and training the thinking process with RL 21:39 – Łukasz’s path: from logic and France to Google and Kurzweil 24:20 – Inside the Transformer story and what “attention” really means 28:42 – From Google Brain to OpenAI: culture, scale and GPUs 32:49 – What’s next for pre-training, GPUs and distillation 37:29 – Can we still understand these models? Circuits, sparsity and black boxes 39:42 – GPT-4 → GPT-5 → GPT-5.1: what actually changed 42:40 – Post-training, safety and teaching GPT-5.1 different tones 46:16 – How long should GPT-5.1 think? Reasoning tokens and jagged abilities 47:43 – The five-year-old’s dot puzzle that still breaks frontier models 52:22 – Generalization, child-like learning and whether reasoning is enough 53:48 – Beyond Transformers: ARC, LeCun’s ideas and multimodal bottlenecks 56:10 – GPT-5.1 Codex Max, long-running agents and compaction 1:00:06 – Will foundation models eat most apps? The translation analogy and trust 1:02:34 – What still needs to be solved, and where AI might go next

Thanksgiving-week treat: an epic conversation on Frontier AI with Lukasz Kaiser -co-author of “Attention Is All You Need” (Transformers) and leading research scientist at OpenAI working on GPT-5.1-era reasoning models. 00:00 – Cold open and intro 01:29 – “AI slowdown” vs a wild week of new frontier models 08:03 – Low-hanging fruit, infra, RL training and better data 11:39 – What is a reasoning model, in plain language 17:02 – Chain-of-thought and training the thinking process with RL 21:39 – Łukasz’s path: from logic and France to Google and Kurzweil 24:20 – Inside the Transformer story and what “attention” really means 28:42 – From Google Brain to OpenAI: culture, scale and GPUs 32:49 – What’s next for pre-training, GPUs and distillation 37:29 – Can we still understand these models? Circuits, sparsity and black boxes 39:42 – GPT-4 → GPT-5 → GPT-5.1: what actually changed 42:40 – Post-training, safety and teaching GPT-5.1 different tones 46:16 – How long should GPT-5.1 think? Reasoning tokens and jagged abilities 47:43 – The five-year-old’s dot puzzle that still breaks frontier models 52:22 – Generalization, child-like learning and whether reasoning is enough 53:48 – Beyond Transformers: ARC, LeCun’s ideas and multimodal bottlenecks 56:10 – GPT-5.1 Codex Max, long-running agents and compaction 1:00:06 – Will foundation models eat most apps? The translation analogy and trust 1:02:34 – What still needs to be solved, and where AI might go next

Matt Turck

167,926 次观看 • 7 个月前

I asked Dan Martell to walk me through every level of making money with AI. He gave me the most simple, practical advice I've ever heard on this subject. Level 1 - Making $0 - $100k Level 2 - Making $1m - $10m Level 3 - Building a $10m++ enterprise. 0:00 Only 5% of the World Has Ever Paid for AI 0:46 The Easiest Thing to Sell With AI Right Now 1:56 The Marcus and Sophie Framework 4:24 Theory of Constraints (Right Problem to Solve) 5:33 What Is the Number One Business Constraint 7:13 How to Leave Your Job and Go All In 8:27 Business Is Simple Find a Problem and Solve It 9:08 Stop Getting Ready to Get Ready 9:33 The Sarah Story One Text and $10K 9:53 Pull Up Your Phone and Message Your Contacts 11:05 Dan's Son Gets His First Client at $800/Month 12:41 Best Employee vs. Best Employer 13:59 What Other Services Can You Sell With AI 14:44 Sales Is Not Talking It's Asking 17:01 What to Do When You Hate Your Business 18:40 Pain and Pleasure Are the Only Two Motivators 19:13 They Haven't Made It a Must Yet 20:29 Make It a Must Not a Nice to Have 21:06 The Jen Story and the Gasping Moment 22:17 How to Find Your First 10 to 15 Clients 28:38 The Personal Brand Play 33:06 Vision Is What AI Cannot Do 34:55 Hard for Computers Easy for Humans 36:13 Level 2 Making Your First Million With AI 37:18 The Replacement Ladder Framework 37:39 Admin First Then Delivery Then Marketing 39:09 Why Marketing Is the Biggest AI Category 39:32 Why You Should Keep Sales for Yourself 40:00 Level 5 Leadership and AI Agents 41:41 What a Fully AI Systems Business Looks Like 43:13 The Gym Owner With Three Locations 46:16 Shutting Down the Company for Two Days 46:37 Teaching the Whole Team to Code in Claude 49:28 Wayne the 62 Year Old Who Made $12K a Month 52:38 I Only Share What Actually Works 53:21 Whisper Flow and Talking to Your AI 56:41 Claude Chat Claude Coworker and Claude Code 57:57 The Claude Browser Extension 58:49 Claude Code Is Not Just for Developers 1:00:06 How to Migrate Your AI Memory Across Tools 1:01:08 Level 3 $1M to $10M and the Brand Play 1:02:05 Nobody Buys AI They Buy Trust 1:03:25 Brand Is Association and Association Is Trust 1:05:12 A Million Followers Is $10M in Activated Revenue 1:07:03 How to Keep AI From Becoming Slop 1:07:42 Human in the Loop 1:08:16 The 10 80 10 Rule and Why AI Is Now the 80 1:10:01 The Team FIRED Themselves 1:11:45 Dan's Free AI Curriculum for Your Team

Grant

114,770 次观看 • 1 天前

Today, we're excited to announce the launch of ⚔️Model Kombat 🥷 What: Coding LLMs go head-to-head on real programming tasks. Who: Developers vote on which solution they'd ship. These votes become training data for better models. Why: Benchmarks should reflect reality. Here's why this changes everything 👇

Today, we're excited to announce the launch of ⚔️Model Kombat 🥷 What: Coding LLMs go head-to-head on real programming tasks. Who: Developers vote on which solution they'd ship. These votes become training data for better models. Why: Benchmarks should reflect reality. Here's why this changes everything 👇

HackerRank

31,078 次观看 • 9 个月前

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE.

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE.

Alistair

820,015 次观看 • 1 年前