正在加载视频...

视频加载失败

Interactive Reasoning Benchmarks are the next step in frontier evaluations Hear Greg Kamradt share why measuring human-like intelligence requires multi-turn environments Including a sneak peak of ARC-AGI-3 Want to help us build interactive evaluations? We're hiring

26,218 次观看 • 1 年前 •via X (Twitter)

8 条评论

ARC Prize 的头像
ARC Prize1 年前

Calling Python Game Developers to help us create fun and challenging mini-games. This is a contract position for a remote game development role. Required Skills: * Strong Python * 2 years of game development experience Email [email protected] with your portfolio

ARC Prize 的头像
ARC Prize1 年前

This presentation was originally given at @aiDotEngineer on June 5, 2025 Slides:

UserInterface 的头像
UserInterface2 年前

Unveiling the Future of Prompt Engineering for Better AI Interactions #tech

Alex Zhang 的头像
Alex Zhang1 年前

Not entirely the same since we're not crafting tasks, but we're (me + @OfirPress) are also interested in benchmarking progress of a single agent / model across multiple games. In the video it's mentioned that we want to avoid data leakage (e.g. in Pokemon) and this is a factor for why Gemini Plays Pokemon succeeds. This is probably true (although it's hard to rigorously prove this fact) but arguably is not the primary issue here. I wouldn't be surprised if you hand-crafted a fake version of Pokemon Blue and the Gemini Plays Pokemon scaffold was able to solve it. I'd wager that the reason why Gemini Plays Pokemon finishes the game while Claude Plays Pokemon gets stuck has less to do with Gemini > Claude or more data leakage, and more to do with the design of their scaffolds. We also see this in our VideoGameBench paper, where minimizing the available scaffold leads to frequent "stuck" behavior regardless of what frontier VLM you use. Super excited about this effort though, and perhaps deploying similar agents on this new game benchmark and VideoGameBench will give us more perspective on where we are with embodied agents :)

shawn swyx wang 的头像
shawn swyx wang1 年前

@GregKamradt ominous

Chris 的头像
Chris1 年前

@GregKamradt So exciting how you guys are already on ARC AGI 3. Do you think that will be the last one before we hit AGI 👀

vmal 的头像
vmal1 年前

@GregKamradt why arc agi has it wrong

Yehyun 的头像
Yehyun1 年前

@GregKamradt This is needed benchmark since it will also represents how well these systems track their long term memory

相关视频

François Chollet (François Chollet) has spent years asking a different question than most of the AI world. Instead of scaling what already works, he’s trying to understand what intelligence actually is and how to build it from first principles. In this episode of the Lightcone Podcast, he traces that path from his early work on deep learning to the creation of the ARC Prize, and the launch of ARC V3, a new benchmark designed to measure something deeper than performance: the ability to learn, adapt, and reason efficiently in entirely new environments. He explains why today’s systems may be hitting limits, what recent breakthroughs really mean, and why reaching true general intelligence may require a fundamentally different approach. 00:00 - AGI by 2030? 00:31 - Introducing Ndea: A New Path Beyond Deep Learning 01:08 - A New ML Paradigm 01:30 - Replacing neural nets with compact symbolic programs 03:04 - Why Ndea Isn’t Competing With Coding Agents 05:20 - Why Everyone Might Be Wrong About Scaling LLMs 07:22 - Why Coding Agents Suddenly Work So Well 08:50 - The Limits of LLMs in Non-Verifiable Domains 10:48 - What AGI Actually Means (And Why Most Definitions Are Wrong) 13:30 - Why Deep Learning Hits a Wall 14:00 - ARC’s Origin Story 18:20 - ARC Benchmarks Explained: From V1 to V3 22:49 - The RL Loop Powering Coding Agents Today 27:03 - ARC-AGI V3: Measuring “Agentic Intelligence” 31:14 - Inside the ARC Game Studio 35:31 - Could AGI Fit in 10,000 Lines of Code? 44:01 - Building Ndea: From Idea to Compounding Research Stack 46:46 - The Future of ARC: Benchmarks That Evolve With AI 47:21 - Why There’s Still Huge Opportunity for New AI Paradigms 53:37 - How to Build a Breakout Open Source Project - Lessons From Keras 56:39 - Advice For How To Think About AI

Y Combinator

151,054 次观看 • 2 个月前