New Benchtalks with John Yang: on ProgramBench (0% frontier models at...

Uploaded: 2026-06-03T18:41:24.000Z
Duration: PT3253.458S
Channel: vincent sunn chen

22:17

Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works. I talked to Alex Shaw about what happens when model capabilities climb faster than we can measure them. His answer: the benchmark factory (Harbor Framework)— infrastructure to develop hard, representative evals at the pace that the frontier moves. As Alex put it: "we need a thousand times more benchmarks than we have right now." 00:23 - How quickly models hill-climbed TB2 01:46 - What rapid progress reveals about benchmarks vs. real-world capability 03:28 - What made Terminal-Bench stick 04:58 - Why the terminal is the right abstraction for agentic AI 07:14 - How TB2 maintains task quality at scale 09:23 - Managing benchmark integrity in a benchmaxxing world 10:47 - Harbor: from experiment to benchmark factory 12:19 - What Harbor does that nothing else did 14:37 - The invariants: what won't change as agent evals evolve 16:55 - The benchmark Alex most wants to see built 18:18 - The ideal human-in-the-loop task creation flywheel 20:32 - How to contribute to Terminal-Bench 3.0

vincent sunn chen

11,500 views • 2 months ago

0:30

GLM 5.2 BEATS GPT-5.5 ON CODING PERFORMANCE The model is reported to outperform GPT-5.5 on multiple coding benchmarks while remaining available on a free tier for developers to test and use Benchmark results show higher scores on SWE-bench Pro at 62.1 compared to 58.6, terminal benchmarks at 81 compared to 74.4, and Frontier-SWE at 74.4, alongside a 1M token context window, open weights, and lower operational cost Users can access it through free setup routes using API-based tools or run it in production through platforms that allow model selection and long context coding workflows The main implication is that high performance coding models are becoming more accessible, reducing dependency on paid frontier models for many development tasks

0xMarioNawfal

46,344 views • 11 hours ago

1:05:24

Are AI benchmarks doomed? Greg Burnham and Tom Adamczewski join Anson Ho to push back on benchmark pessimism and dig into what the next generation of AI benchmarks could look like. (0:00:00) - Preview (0:00:36) - Intro: Are AI benchmarks doomed? (0:03:13) - The costs and benefits of benchmark development (0:11:48) - MirrorCode and scalable benchmarks (0:20:57) - AI speed-up in benchmark development (0:23:28) - The benchmark-reality gap (0:38:26) - Can an AGI benchmark exist? (0:43:18) - Beyond automated scoring (1:00:45) - How AI changes benchmark building in practice

Epoch AI

22,037 views • 1 month ago

1:00:27

Get the inside story on the development of Gemini's coding capabilities. Listen as the product and research leads for Gemini share their philosophy on what makes a great coding model, the impact of "vibe coding," and the future of programming languages with Logan Kilpatrick, Connie Fan and Danny Tarlow. Timecodes: 0:00 Intro 1:10 Defining Early Coding Goals 6:23 Ingredients of a Great Coding Model 9:28 Adapting to Developer Workflows 11:40 The Rise of Vibe Coding 14:43 Code as a Reasoning Tool 17:20 Code as a Universal Solver 20:47 Evaluating Coding Models 24:30 Leveraging Internal Googler Feedback 26:52 Winning Over AI Skeptics 28:04 Performance Across Programming Languages 33:05 The Future of Programming Languages 36:16 Strategies for Large Codebases 41:06 Hill Climbing New Benchmarks 42:46 Short-Term Improvements 44:42 Model Style and Taste 47:43 2.5 Pro’s Breakthrough 51:06 Early AI Coding Experiences 56:19 Specialist vs. Generalist Models

Google AI Developers

65,474 views • 1 year ago

57:23

François Chollet (François Chollet) has spent years asking a different question than most of the AI world. Instead of scaling what already works, he’s trying to understand what intelligence actually is and how to build it from first principles. In this episode of the Lightcone Podcast, he traces that path from his early work on deep learning to the creation of the ARC Prize, and the launch of ARC V3, a new benchmark designed to measure something deeper than performance: the ability to learn, adapt, and reason efficiently in entirely new environments. He explains why today’s systems may be hitting limits, what recent breakthroughs really mean, and why reaching true general intelligence may require a fundamentally different approach. 00:00 - AGI by 2030? 00:31 - Introducing Ndea: A New Path Beyond Deep Learning 01:08 - A New ML Paradigm 01:30 - Replacing neural nets with compact symbolic programs 03:04 - Why Ndea Isn’t Competing With Coding Agents 05:20 - Why Everyone Might Be Wrong About Scaling LLMs 07:22 - Why Coding Agents Suddenly Work So Well 08:50 - The Limits of LLMs in Non-Verifiable Domains 10:48 - What AGI Actually Means (And Why Most Definitions Are Wrong) 13:30 - Why Deep Learning Hits a Wall 14:00 - ARC’s Origin Story 18:20 - ARC Benchmarks Explained: From V1 to V3 22:49 - The RL Loop Powering Coding Agents Today 27:03 - ARC-AGI V3: Measuring “Agentic Intelligence” 31:14 - Inside the ARC Game Studio 35:31 - Could AGI Fit in 10,000 Lines of Code? 44:01 - Building Ndea: From Idea to Compounding Research Stack 46:46 - The Future of ARC: Benchmarks That Evolve With AI 47:21 - Why There’s Still Huge Opportunity for New AI Paradigms 53:37 - How to Build a Breakout Open Source Project - Lessons From Keras 56:39 - Advice For How To Think About AI

Y Combinator

151,054 views • 2 months ago

54:53

We are excited to share that Logan Kilpatrick joined us on The Bench to discuss Google's new Gemini 3.5 Flash: why it's deliberately more persistent and capable than previous Flash models, how it hit #1 on our FinanceAgent Benchmark taking 82 steps where competitors stopped at 13, and what justifies the price increase. We also get into why AI benchmarks need a paradigm shift, the trade-off of building everything vs staying focused, the Pope, and why Omni might kill the Subway Surfers content era. 0:11:00 – Flash is being rebased for the agent era, not just a cheaper model anymore 0:14:03 – Persistence by design: 82 tool calls vs competitors' 13 0:17:52 – Why pricing went up and how Google thinks about value per token 0:22:55 – Coding performance: from 20th to 10th place in one generation 0:28:28 – Why benchmarks have historically been misleading and what the new era of evaluation looks like 0:29:28 Logan on why Google has the best researchers in the world 0:36:16 – The cost of being Google 0:39:07 – The Pope’s encyclical on AI and whether most people see frontier intelligence as a good thing 0:51:12 – Why Omni is the thing that recently clicked

Vals AI

16,345 views • 23 days ago

2:30

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. OpenAI GPT 5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1.

adarsh

207,839 views • 2 months ago

34:47

François Chollet (François Chollet) on the ARC Prize and how we get to AGI. At AI Startup School in San Francisco. 00:00 - The Falling Cost of Compute 00:57 - Deep-Learning’s Scaling Era & Benchmarks 01:59 - The ARC Benchmark 03:02 - The 2024 Shift to Test-Time Adaptation 05:01 - What Is Intelligence? 07:12 - Why Benchmarks Matter (and Mislead) 08:57 - ARC 1 Exposes Scaling Limits 10:58 - ARC 2: Compositional Reasoning Arrives 12:55 - Humans vs. Models on ARC2 14:58 - Previewing ARC3 & Interactive Agency 17:00 - Kaleidoscopic Hypothesis and Abstractions 22:00 - Type 1 vs. Type 2 Abstractions 26:00 - Discrete Program Search & Inventive AI 29:00 - Fusing Intuition with Symbolic Reasoning 32:00 - Building AGI Through Meta-Learning Systems

Y Combinator

231,725 views • 11 months ago

11:59

ARC-AGI is redefining how to measure progress on the path to AGI - focusing on reasoning, generalization, and adaptability instead of memorization or scale. At NeurIPS 2025, YC's Diana sat down with ARC Prize President Greg Kamradt to find out why most AI benchmarks fail, how ARC-AGI reveals the limits of today’s models, and why measuring intelligence may be harder than building it. 00:11 — What ARC Prize is and why it exists 00:38 — François Chollet’s definition of AGI 01:48 — What ARC-AGI Actually Tests 02:25 — When LLMs Failed the ARC Benchmark 03:38 — ARC-AGI Becomes the Standard 04:49 — False Positives in AI Progress 06:06 — The Evolution of ARC-AGI 08:55 — Measuring Intelligence beyond just accuracy 10:25 — What happens if a model solves ARC-AGI?

Y Combinator

98,369 views • 6 months ago

0:33

Introducing Arena Mode in Windsurf: One prompt. Two models. Your vote. Benchmarks don't reflect real-world coding quality. The best model for you depends on your codebase and stack. So we made real-world coding the benchmark. Free for the next week. May the best model win.

Devin Desktop

1,061,420 views • 4 months ago

2:10:59

How does math research change when the cost of trying your first dumb idea goes to zero? Daniel Litt joins Greg Burnham and Anson Ho to discuss what today’s models can and can’t do in math, and how far they are from doing high-quality research. 0:00:00 What's the hardest math problem AI can solve today? 00:16:08 How helpful are today’s AI models for math research? 00:23:36 Junk papers, LLM-generated proofs, and the refereeing crisis 00:27:21 AI enables searching through problems at scale 00:33:49 When will AI be good enough to publish in top math journals? 00:42:15 What are the returns to intelligence? 00:59:50 Will AI solve Millennium problems? 01:11:54 Is math full of low-hanging fruit? 01:18:47 How Daniel has adapted his professional life to AI progress 01:25:28 What do AI math benchmarks actually measure? 01:33:05 Designing the Open Problems benchmark 01:56:35 Do mathematicians believe heuristic arguments about conjectures? 02:01:24 What if FrontierMath: Open Problems gets solved? 02:06:53 Is AI on the cusp of accelerating math progress?

Epoch AI

178,703 views • 4 months ago

1:16:04

How GPT-5 thinks, with OpenAI VP of Research Jerry Tworek 00:00 - Intro 01:01 - What Reasoning Actually Means in AI 02:32 - Chain of Thought: Models Thinking in Words 05:25 - How Models Decide How Long to Think 07:24 - Evolution from o1 to o3 to GPT-5 11:00 - The Road to OpenAI: Growing up in Poland, Dropping out of School, Trading 20:32 - Working on Robotics and Rubik's Cube Solving 23:02 - A Day in the Life: Talking to Researchers 24:06 - How Research Priorities Are Determined 26:53 - OpenAI's Culture of Transparency 29:32 - Balancing Research with Shipping Fast 31:52 - Using OpenAI's Own Tools Daily 32:43 - Pre-Training Plus RL: The Modern AI Stack 35:10 - Reinforcement Learning 101: Training Dogs 40:17 - The Evolution of Deep Reinforcement Learning 42:09 - When GPT-4 Seemed Underwhelming at First 45:39 - How RLHF Made GPT-4 Actually Useful 48:02 - Unsupervised vs Supervised Learning 49:59 - GRPO and How DeepSeek Accelerated US Research 53:05 - What It Takes to Scale Reinforcement Learning 55:36 - Agentic AI and Long-Horizon Thinking 59:19 - Alignment as an RL Problem 1:01:11 - Winning ICPC World Finals Without Specific Training 1:05:53 - Applying RL Beyond Math and Coding 1:09:15 - The Path from Here to AGI 1:12:23 - Pure RL vs Language Models

Matt Turck

451,229 views • 8 months ago

49:06

Why AI Can Now Make Discoveries - my conversation with Dan Roberts, Lead of the Foundations of Reinforcement Learning team at OpenAI 00:00 Intro: AI's wild week in mathematics 01:21 What OpenAI's Foundations of RL team does 03:08 Dan's journey: from black holes and quantum gravity to frontier AI 07:04 Are AI systems becoming useful for real science 08:21 The AI math moment: Erdős, OpenAI, DeepMind, and Anthropic 08:52 Why the OpenAI result was an act of exploration 10:25 OpenAI vs. DeepMind: informal reasoning vs. formal proof 12:13 RL 101: learning by doing, not just watching 15:10 Why reinforcement learning works 15:58 How RL breaks: sparse feedback and long-horizon tasks 17:03 RLHF: how human feedback shaped early language models 18:48 Move 37, self-play, and the search for novel strategies 22:16 Explore vs. exploit in scientific discovery 24:49 Why RL may now be "the cake," not the cherry on top 25:46 Why RL started working with large language models 27:29 Is RL "sucking supervision through a straw"? 28:47 Why language may be the grounding layer for intelligence 31:46 A contrarian take on the Bitter Lesson 32:41 What test-time compute actually is 34:50 How RL gives models the ability to think 35:40 Verifiable rewards, math, coding, and the messy real world 38:00 What physics can teach us about AI 42:08 Is there a thermodynamics of AI? 43:08 From Erdős problems to Einstein-level AI 45:16 Is AI already doing original science? 45:51 How far are we from AI automating AI research 47:41 Why Dan is excited about the future of science

Matt Turck

63,471 views • 16 days ago

51:35

My interview with Amazon Web Services CEO Matt Garman. 0:00 Intro 0:57 White Collar Jobs 8:51 How much of AWS's code is written by AI? 12:54 How to have a career in the AI era 15:43 AI Bottlenecks 18:06 Inference vs Training Growth 20:05 AWS Custom Silicon 25:50 Annapurna Acquisition 27:53 AI Models 33:35 Open vs Closed Models 41:28 Benchmarks 47:13 Agents

Matthew Berman

49,126 views • 10 months ago

0:23

As detailed in the Meta Movie Gen technical report, today we’re open sourcing Movie Gen Bench: two new media generation benchmarks that we hope will help to enable the AI research community to progress work on more capable audio and video generation models. Movie Gen Video Bench is the largest and most comprehensive benchmark ever released for evaluating text-to-video generation. It includes a collection of 1,000+ prompts that cover concepts ranging from detailed human activity to animals, physics, unusual subjects and more — with broad coverage across different motion levels. Movie Gen Audio Bench is a first-of-its-kind benchmark aimed at evaluating video-to-audio and (text+video)-to-audio generation. It includes 527 generated videos and associated sound effects and music prompts covering a diverse set of ambient environments and sound effects. To enable fair and easy comparison to our models for future works, these new benchmarks include non cherry-picked generated videos and audio from Movie Gen. In releasing these new benchmarks we hope to promote fair & extensive evaluations in media generation research to enable greater progress in this field.

AI at Meta

156,240 views • 1 year ago

0:23

Andon Labs' Real-World AI Evals: Claude calls the FBI, AI CEOs, price cartels, Butter-Bench, & Luna Andon Labs cofounders Lukas Petersson and Axel Backlund explain why dollar-denominated evals reveal what traditional benchmarks miss, how Claude ended up reporting a $2/day vending machine fee to the FBI, why long-horizon agents spiral in weird ways, what happens when agents lie, form price cartels, and compete with each other, and why the future of AI safety may depend on testing models in messy real-world environments instead of clean benchmark sandboxes.

Latent.Space

12,957 views • 16 days ago

1:10:03

Sonnet 4.5 & the AI Plateau Myth — epic conversation with Sholto Douglas of Anthropic 0:00 - Intro 1:09 - What's Behind The Rapid Pace of AI Releases at Anthropic 2:49 - Opus, Sonnet, and Haiku Model Tiers 4:14 - Sholto's Story: From Australian Fencer to AI Researcher 12:01 - The YouTube Effect: Mastery Through Observation 16:16 - Breaking Into AI Research Without Traditional Academic Signals 18:29 - DeepMind, Gemini, and Building Inference Stacks 23:05 - Why Anthropic? Culture and Mission Differences Amongst AI Research Labs 25:08 - What Is "Taste" in AI Research? 31:46 - This Week's Big Launch: Sonnet 4.5, Best Coding Model in the World 36:40 - From 7 Hours to 30 Hours: The Long-Running AI Agent Breakthrough 38:41 - How AI Agents Self-Correct and Maintain Coherence 43:13 - The Role of Memory in Extended Coding Sessions 47:42 - Pre-Training vs. RL: Textbooks vs. Worked Problems 52:11 - Test-Time Compute & Reinforcement Learning 55:55 - Why RL Finally Started Working on LLMs in 2024 59:38 - The Path to AGI 1:02:05 - Are We Hitting a Plateau in AI? So Many Low Hanging Fruits 1:03:41 - Beyond Coding: GDPVal & Impact Economic Sectors 1:05:47 - Preparing for 10-100x Individual Leverage & The Upcoming Robotics Explosion

Matt Turck

93,678 views • 8 months ago

1:05:25

Thanksgiving-week treat: an epic conversation on Frontier AI with Lukasz Kaiser -co-author of “Attention Is All You Need” (Transformers) and leading research scientist at OpenAI working on GPT-5.1-era reasoning models. 00:00 – Cold open and intro 01:29 – “AI slowdown” vs a wild week of new frontier models 08:03 – Low-hanging fruit, infra, RL training and better data 11:39 – What is a reasoning model, in plain language 17:02 – Chain-of-thought and training the thinking process with RL 21:39 – Łukasz’s path: from logic and France to Google and Kurzweil 24:20 – Inside the Transformer story and what “attention” really means 28:42 – From Google Brain to OpenAI: culture, scale and GPUs 32:49 – What’s next for pre-training, GPUs and distillation 37:29 – Can we still understand these models? Circuits, sparsity and black boxes 39:42 – GPT-4 → GPT-5 → GPT-5.1: what actually changed 42:40 – Post-training, safety and teaching GPT-5.1 different tones 46:16 – How long should GPT-5.1 think? Reasoning tokens and jagged abilities 47:43 – The five-year-old’s dot puzzle that still breaks frontier models 52:22 – Generalization, child-like learning and whether reasoning is enough 53:48 – Beyond Transformers: ARC, LeCun’s ideas and multimodal bottlenecks 56:10 – GPT-5.1 Codex Max, long-running agents and compaction 1:00:06 – Will foundation models eat most apps? The translation analogy and trust 1:02:34 – What still needs to be solved, and where AI might go next

Matt Turck

167,926 views • 6 months ago

0:55

Today, we're excited to announce the launch of ⚔️Model Kombat 🥷 What: Coding LLMs go head-to-head on real programming tasks. Who: Developers vote on which solution they'd ship. These votes become training data for better models. Why: Benchmarks should reflect reality. Here's why this changes everything 👇

HackerRank

31,078 views • 9 months ago

1:53

Databricks is excited to partner with OpenAI on GPT-5.5, their latest frontier model. GPT-5.5 will be available in Unity AI Gateway on launch. You can use it with coding tools such as Codex, or to power your enterprise agents. GPT-5.5 is state-of-the-art on many benchmarks including OfficeQA Pro, our benchmark for evaluating grounded reasoning on enterprise tasks. We are partnering with OpenAI to co-launch on Databricks. Hear more from our co-founder Patrick Wendell and OpenAI CRO Denise Holland Dresser on GPT-5.5 in Databricks.

Databricks

12,668 views • 1 month ago

Live Cam

Video Failed to Load

vincent sunn chen

Anya Rossi• Live Now

0 Comments

Related Videos

Introducing Arena Mode in Windsurf: One prompt. Two models. Your vote. Benchmarks don't reflect real-world coding quality. The best model for you depends on your codebase and stack. So we made real-world coding the benchmark. Free for the next week. May the best model win.