正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Interactive Reasoning Benchmarks are the next step in frontier evaluations Hear Greg Kamradt share why measuring human-like intelligence requires multi-turn environments Including a sneak peak of ARC-AGI-3 Want to help us build interactive evaluations? We're hiring

ARC Prize

21,395 subscribers

26,218 次观看 • 1 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

8 条评论

ARC Prize 的头像

ARC Prize1 年前

Calling Python Game Developers to help us create fun and challenging mini-games. This is a contract position for a remote game development role. Required Skills: * Strong Python * 2 years of game development experience Email [email protected] with your portfolio

ARC Prize 的头像

ARC Prize1 年前

This presentation was originally given at @aiDotEngineer on June 5, 2025 Slides:

UserInterface 的头像

UserInterface2 年前

Unveiling the Future of Prompt Engineering for Better AI Interactions #tech

Alex Zhang 的头像

Alex Zhang1 年前

Not entirely the same since we're not crafting tasks, but we're (me + @OfirPress) are also interested in benchmarking progress of a single agent / model across multiple games. In the video it's mentioned that we want to avoid data leakage (e.g. in Pokemon) and this is a factor for why Gemini Plays Pokemon succeeds. This is probably true (although it's hard to rigorously prove this fact) but arguably is not the primary issue here. I wouldn't be surprised if you hand-crafted a fake version of Pokemon Blue and the Gemini Plays Pokemon scaffold was able to solve it. I'd wager that the reason why Gemini Plays Pokemon finishes the game while Claude Plays Pokemon gets stuck has less to do with Gemini > Claude or more data leakage, and more to do with the design of their scaffolds. We also see this in our VideoGameBench paper, where minimizing the available scaffold leads to frequent "stuck" behavior regardless of what frontier VLM you use. Super excited about this effort though, and perhaps deploying similar agents on this new game benchmark and VideoGameBench will give us more perspective on where we are with embodied agents :)

shawn swyx wang 的头像

shawn swyx wang1 年前

@GregKamradt ominous

Chris 的头像

Chris1 年前

@GregKamradt So exciting how you guys are already on ARC AGI 3. Do you think that will be the last one before we hit AGI 👀

vmal 的头像

vmal1 年前

@GregKamradt why arc agi has it wrong

Yehyun 的头像

Yehyun1 年前

@GregKamradt This is needed benchmark since it will also represents how well these systems track their long term memory

相关视频

ARC-AGI-3 Preview Event Recap Greg Kamradt steps through our Interactive Reasoning Benchmark thesis * Why static benchmarks fall short measuring agentic capabilities * The ARC Prize approach to creating interactive benchmarks

ARC-AGI-3 Preview Event Recap Greg Kamradt steps through our Interactive Reasoning Benchmark thesis * Why static benchmarks fall short measuring agentic capabilities * The ARC Prize approach to creating interactive benchmarks

ARC Prize

22,248 次观看 • 10 个月前

ARC-AGI is redefining how to measure progress on the path to AGI - focusing on reasoning, generalization, and adaptability instead of memorization or scale. At NeurIPS 2025, YC's Diana sat down with ARC Prize President Greg Kamradt to find out why most AI benchmarks fail, how ARC-AGI reveals the limits of today’s models, and why measuring intelligence may be harder than building it. 00:11 — What ARC Prize is and why it exists 00:38 — François Chollet’s definition of AGI 01:48 — What ARC-AGI Actually Tests 02:25 — When LLMs Failed the ARC Benchmark 03:38 — ARC-AGI Becomes the Standard 04:49 — False Positives in AI Progress 06:06 — The Evolution of ARC-AGI 08:55 — Measuring Intelligence beyond just accuracy 10:25 — What happens if a model solves ARC-AGI?

ARC-AGI is redefining how to measure progress on the path to AGI - focusing on reasoning, generalization, and adaptability instead of memorization or scale. At NeurIPS 2025, YC's Diana sat down with ARC Prize President Greg Kamradt to find out why most AI benchmarks fail, how ARC-AGI reveals the limits of today’s models, and why measuring intelligence may be harder than building it. 00:11 — What ARC Prize is and why it exists 00:38 — François Chollet’s definition of AGI 01:48 — What ARC-AGI Actually Tests 02:25 — When LLMs Failed the ARC Benchmark 03:38 — ARC-AGI Becomes the Standard 04:49 — False Positives in AI Progress 06:06 — The Evolution of ARC-AGI 08:55 — Measuring Intelligence beyond just accuracy 10:25 — What happens if a model solves ARC-AGI?

Y Combinator

98,369 次观看 • 6 个月前

ARC-AGI-3 Launch Presentation Greg Kamradt launches ARC-AGI-3 on March 25, 2026 in San Francisco Including: - ARC-AGI as a signal for external progress - ARC-AGI-3 Game design process - Announcement of ARC Prize 2026 Competition

ARC-AGI-3 Launch Presentation Greg Kamradt launches ARC-AGI-3 on March 25, 2026 in San Francisco Including: - ARC-AGI as a signal for external progress - ARC-AGI-3 Game design process - Announcement of ARC Prize 2026 Competition

ARC Prize

20,097 次观看 • 2 个月前

CEO Dr. Ben Goertzel discusses AGI benchmarks and explains why a system passing François Chollet's Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) wouldn’t necessarily qualify as human-level AGI.

CEO Dr. Ben Goertzel discusses AGI benchmarks and explains why a system passing François Chollet's Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) wouldn’t necessarily qualify as human-level AGI.

SingularityNET

26,943 次观看 • 1 年前

ARC Prize President Greg Kamradt says OpenAI's o3 model has surpassed the human threshold on the ARC-AGI benchmark - scoring 87.5 where the human threshold is 85 in a "major milestone" that is "new territory" in the ARC-AGI world

ARC Prize President Greg Kamradt says OpenAI's o3 model has surpassed the human threshold on the ARC-AGI benchmark - scoring 87.5 where the human threshold is 85 in a "major milestone" that is "new territory" in the ARC-AGI world

Tsarathustra

71,466 次观看 • 1 年前

Hiring RL Engineer! Started off as a curious project at Lossfunk to push the boundaries of LLMs in social reasoning - we are now building RL environments, data, and benchmarks to simulate more real-world scenarios. If you want to train SoTA RL models over multi-GPUs (H200s/B200s) to unlock next AI frontier, this is for you.

Hiring RL Engineer! Started off as a curious project at Lossfunk to push the boundaries of LLMs in social reasoning - we are now building RL environments, data, and benchmarks to simulate more real-world scenarios. If you want to train SoTA RL models over multi-GPUs (H200s/B200s) to unlock next AI frontier, this is for you.

Satpal Singh Rathore

45,915 次观看 • 10 个月前

🚀 Introducing Community Benchmarks on Kaggle! As AI evolves at an unprecedented pace, measuring intelligence requires more than a few AI research labs alone – it requires the imagination and collective expertise of the global community. That’s why we’re launching Community Benchmarks. It lets you build, run, and share custom AI benchmarks that are evaluated on the leading AI models with transparent, reproducible results. Learn more:

🚀 Introducing Community Benchmarks on Kaggle! As AI evolves at an unprecedented pace, measuring intelligence requires more than a few AI research labs alone – it requires the imagination and collective expertise of the global community. That’s why we’re launching Community Benchmarks. It lets you build, run, and share custom AI benchmarks that are evaluated on the leading AI models with transparent, reproducible results. Learn more:

Kaggle

54,447 次观看 • 5 个月前

ICYMI Cowboy Space Corp. is building a power grid in outer space for artificial intelligence. Join us to help build the future. We're hiring!

ICYMI Cowboy Space Corp. is building a power grid in outer space for artificial intelligence. Join us to help build the future. We're hiring!

Baiju Bhatt

78,110 次观看 • 1 个月前

François Chollet (François Chollet) on the ARC Prize and how we get to AGI. At AI Startup School in San Francisco. 00:00 - The Falling Cost of Compute 00:57 - Deep-Learning’s Scaling Era & Benchmarks 01:59 - The ARC Benchmark 03:02 - The 2024 Shift to Test-Time Adaptation 05:01 - What Is Intelligence? 07:12 - Why Benchmarks Matter (and Mislead) 08:57 - ARC 1 Exposes Scaling Limits 10:58 - ARC 2: Compositional Reasoning Arrives 12:55 - Humans vs. Models on ARC2 14:58 - Previewing ARC3 & Interactive Agency 17:00 - Kaleidoscopic Hypothesis and Abstractions 22:00 - Type 1 vs. Type 2 Abstractions 26:00 - Discrete Program Search & Inventive AI 29:00 - Fusing Intuition with Symbolic Reasoning 32:00 - Building AGI Through Meta-Learning Systems

François Chollet (François Chollet) on the ARC Prize and how we get to AGI. At AI Startup School in San Francisco. 00:00 - The Falling Cost of Compute 00:57 - Deep-Learning’s Scaling Era & Benchmarks 01:59 - The ARC Benchmark 03:02 - The 2024 Shift to Test-Time Adaptation 05:01 - What Is Intelligence? 07:12 - Why Benchmarks Matter (and Mislead) 08:57 - ARC 1 Exposes Scaling Limits 10:58 - ARC 2: Compositional Reasoning Arrives 12:55 - Humans vs. Models on ARC2 14:58 - Previewing ARC3 & Interactive Agency 17:00 - Kaleidoscopic Hypothesis and Abstractions 22:00 - Type 1 vs. Type 2 Abstractions 26:00 - Discrete Program Search & Inventive AI 29:00 - Fusing Intuition with Symbolic Reasoning 32:00 - Building AGI Through Meta-Learning Systems

Y Combinator

231,696 次观看 • 11 个月前

This release is fucking huge. It's one of the biggest updates to LMArena this year! Code Arena is our next generation of coding evaluations, beginning with web development tasks. Here you can use models to build interactive websites and share them with your friends. The links are persistent, so you can e.g. build a game and play it whenever you want. Here watch two models -- Claude Haiku and Grok-Code-Fast -- compete to build a galaxy. In this case, I liked the "star-wars" effect of Grok!

This release is fucking huge. It's one of the biggest updates to LMArena this year! Code Arena is our next generation of coding evaluations, beginning with web development tasks. Here you can use models to build interactive websites and share them with your friends. The links are persistent, so you can e.g. build a game and play it whenever you want. Here watch two models -- Claude Haiku and Grok-Code-Fast -- compete to build a galaxy. In this case, I liked the "star-wars" effect of Grok!

Anastasios Nikolas Angelopoulos

38,021 次观看 • 7 个月前

Introducing Odyssey-2 Pro—a frontier world model that generates long-running, interactive simulations in 720p! We're also launching the first world model API, to enable devs to build magical apps. We're now in the GPT-2 era of world models. Let the explosion of apps commence!

Introducing Odyssey-2 Pro—a frontier world model that generates long-running, interactive simulations in 720p! We're also launching the first world model API, to enable devs to build magical apps. We're now in the GPT-2 era of world models. Let the explosion of apps commence!

Odyssey

165,905 次观看 • 4 个月前

François Chollet (François Chollet) has spent years asking a different question than most of the AI world. Instead of scaling what already works, he’s trying to understand what intelligence actually is and how to build it from first principles. In this episode of the Lightcone Podcast, he traces that path from his early work on deep learning to the creation of the ARC Prize, and the launch of ARC V3, a new benchmark designed to measure something deeper than performance: the ability to learn, adapt, and reason efficiently in entirely new environments. He explains why today’s systems may be hitting limits, what recent breakthroughs really mean, and why reaching true general intelligence may require a fundamentally different approach. 00:00 - AGI by 2030? 00:31 - Introducing Ndea: A New Path Beyond Deep Learning 01:08 - A New ML Paradigm 01:30 - Replacing neural nets with compact symbolic programs 03:04 - Why Ndea Isn’t Competing With Coding Agents 05:20 - Why Everyone Might Be Wrong About Scaling LLMs 07:22 - Why Coding Agents Suddenly Work So Well 08:50 - The Limits of LLMs in Non-Verifiable Domains 10:48 - What AGI Actually Means (And Why Most Definitions Are Wrong) 13:30 - Why Deep Learning Hits a Wall 14:00 - ARC’s Origin Story 18:20 - ARC Benchmarks Explained: From V1 to V3 22:49 - The RL Loop Powering Coding Agents Today 27:03 - ARC-AGI V3: Measuring “Agentic Intelligence” 31:14 - Inside the ARC Game Studio 35:31 - Could AGI Fit in 10,000 Lines of Code? 44:01 - Building Ndea: From Idea to Compounding Research Stack 46:46 - The Future of ARC: Benchmarks That Evolve With AI 47:21 - Why There’s Still Huge Opportunity for New AI Paradigms 53:37 - How to Build a Breakout Open Source Project - Lessons From Keras 56:39 - Advice For How To Think About AI

François Chollet (François Chollet) has spent years asking a different question than most of the AI world. Instead of scaling what already works, he’s trying to understand what intelligence actually is and how to build it from first principles. In this episode of the Lightcone Podcast, he traces that path from his early work on deep learning to the creation of the ARC Prize, and the launch of ARC V3, a new benchmark designed to measure something deeper than performance: the ability to learn, adapt, and reason efficiently in entirely new environments. He explains why today’s systems may be hitting limits, what recent breakthroughs really mean, and why reaching true general intelligence may require a fundamentally different approach. 00:00 - AGI by 2030? 00:31 - Introducing Ndea: A New Path Beyond Deep Learning 01:08 - A New ML Paradigm 01:30 - Replacing neural nets with compact symbolic programs 03:04 - Why Ndea Isn’t Competing With Coding Agents 05:20 - Why Everyone Might Be Wrong About Scaling LLMs 07:22 - Why Coding Agents Suddenly Work So Well 08:50 - The Limits of LLMs in Non-Verifiable Domains 10:48 - What AGI Actually Means (And Why Most Definitions Are Wrong) 13:30 - Why Deep Learning Hits a Wall 14:00 - ARC’s Origin Story 18:20 - ARC Benchmarks Explained: From V1 to V3 22:49 - The RL Loop Powering Coding Agents Today 27:03 - ARC-AGI V3: Measuring “Agentic Intelligence” 31:14 - Inside the ARC Game Studio 35:31 - Could AGI Fit in 10,000 Lines of Code? 44:01 - Building Ndea: From Idea to Compounding Research Stack 46:46 - The Future of ARC: Benchmarks That Evolve With AI 47:21 - Why There’s Still Huge Opportunity for New AI Paradigms 53:37 - How to Build a Breakout Open Source Project - Lessons From Keras 56:39 - Advice For How To Think About AI

Y Combinator

151,054 次观看 • 2 个月前

Demis Hassabis says AGI is a scientific goal worth pursuing -- a testable frontier of general intelligence. Kate Crawford pushes back: the industry is already optimized for AGI. What it lacks is alignment with goals that serve people and the planet. "When do we build AI that's actually good for everyone? These are the real benchmarks."

Demis Hassabis says AGI is a scientific goal worth pursuing -- a testable frontier of general intelligence. Kate Crawford pushes back: the industry is already optimized for AGI. What it lacks is alignment with goals that serve people and the planet. "When do we build AI that's actually good for everyone? These are the real benchmarks."

vitrupo

29,243 次观看 • 1 年前

Demis Hassabis: "I think there is a need for some great philosophers. Where are they? The great next philosophers... I think we're going to need that to help navigate society to that next step. AGI and ASI is going to change humanity and the human condition."

Demis Hassabis: "I think there is a need for some great philosophers. Where are they? The great next philosophers... I think we're going to need that to help navigate society to that next step. AGI and ASI is going to change humanity and the human condition."

Smoke-away

122,048 次观看 • 1 年前

With the unchecked race to build smarter-than-human AI intensifying, humanity is on track to almost certainly lose control. In "Keep The Future Human", FLI Executive Director Anthony Aguirre explains why we must close the 'gates' to AGI - and instead develop beneficial, safe Tool AI built to serve us, not replace us. We're at a crossroads: continue down this dangerous path, or choose a future where AI enhances human potential, rather than threatening it. 🔗 Read Anthony's full "Keep The Future Human" essay - or explore the interactive summary - at the link in the replies:

With the unchecked race to build smarter-than-human AI intensifying, humanity is on track to almost certainly lose control. In "Keep The Future Human", FLI Executive Director Anthony Aguirre explains why we must close the 'gates' to AGI - and instead develop beneficial, safe Tool AI built to serve us, not replace us. We're at a crossroads: continue down this dangerous path, or choose a future where AI enhances human potential, rather than threatening it. 🔗 Read Anthony's full "Keep The Future Human" essay - or explore the interactive summary - at the link in the replies:

Future of Life Institute

32,982 次观看 • 1 年前

.Google DeepMind CEO Demis Hassabis joins Logan Kilpatrick on Release Notes to chat about the momentum of AI development, world models like Genie 3, improved model evaluations through Game Arena with @Kaggle, and the quest for AGI.

.Google DeepMind CEO Demis Hassabis joins Logan Kilpatrick on Release Notes to chat about the momentum of AI development, world models like Genie 3, improved model evaluations through Game Arena with @Kaggle, and the quest for AGI.

Google AI Developers

20,468 次观看 • 10 个月前

Evaluating robot policies is hard. Every lab has a different robot; reproducible evaluations are really challenging. This makes it hard for us to know which methods for learning robot policies are likely to perform the best in real-world scenarios. Taking a page from LLM evaluations like Chatbot Arena, RoboArena aims to address this problem through crowdsourcing evaluations with a network of different evaluators. Watch Episode #34 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton , now to learn more from authors Pranav Atreya and Karl Pertsch!

Evaluating robot policies is hard. Every lab has a different robot; reproducible evaluations are really challenging. This makes it hard for us to know which methods for learning robot policies are likely to perform the best in real-world scenarios. Taking a page from LLM evaluations like Chatbot Arena, RoboArena aims to address this problem through crowdsourcing evaluations with a network of different evaluators. Watch Episode #34 of RoboPapers, hosted by Michael Cho - Rbt/Acc and Chris Paxton , now to learn more from authors Pranav Atreya and Karl Pertsch!

RoboPapers

15,056 次观看 • 8 个月前

Today we are announcing Genie 3, a general purpose world model by Google DeepMind that can generate dynamic, interactive environments with a single text prompt. World models are AI that understand facets of the world (like Veo's knowledge of intuitive physics or Genie's mastery of new environments), and serve as a key stepping stone on the path to AGI. Genie 3 is our first world model to allow interaction in real-time, while also improving consistency and realism compared to Genie 2. Learn more ➡️

Today we are announcing Genie 3, a general purpose world model by Google DeepMind that can generate dynamic, interactive environments with a single text prompt. World models are AI that understand facets of the world (like Veo's knowledge of intuitive physics or Genie's mastery of new environments), and serve as a key stepping stone on the path to AGI. Genie 3 is our first world model to allow interaction in real-time, while also improving consistency and realism compared to Genie 2. Learn more ➡️

Google AI

102,524 次观看 • 10 个月前

Gemini 3 Flash can analyze high-fidelity images and use multimodal reasoning to determine next steps. See how the model understands complex visuals and generates layers of interactive elements.

Gemini 3 Flash can analyze high-fidelity images and use multimodal reasoning to determine next steps. See how the model understands complex visuals and generates layers of interactive elements.

Google AI Developers

47,280 次观看 • 5 个月前