Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Interactive Reasoning Benchmarks are the next step in frontier evaluations Hear Greg Kamradt share why measuring human-like intelligence requires multi-turn environments Including a sneak peak of ARC-AGI-3 Want to help us build interactive evaluations? We're hiring

26,218 Aufrufe • vor 1 Jahr •via X (Twitter)

8 Kommentare

Profilbild von ARC Prize
ARC Prizevor 1 Jahr

Calling Python Game Developers to help us create fun and challenging mini-games. This is a contract position for a remote game development role. Required Skills: * Strong Python * 2 years of game development experience Email [email protected] with your portfolio

Profilbild von ARC Prize
ARC Prizevor 1 Jahr

This presentation was originally given at @aiDotEngineer on June 5, 2025 Slides:

Profilbild von UserInterface
UserInterfacevor 2 Jahren

Unveiling the Future of Prompt Engineering for Better AI Interactions #tech

Profilbild von Alex Zhang
Alex Zhangvor 1 Jahr

Not entirely the same since we're not crafting tasks, but we're (me + @OfirPress) are also interested in benchmarking progress of a single agent / model across multiple games. In the video it's mentioned that we want to avoid data leakage (e.g. in Pokemon) and this is a factor for why Gemini Plays Pokemon succeeds. This is probably true (although it's hard to rigorously prove this fact) but arguably is not the primary issue here. I wouldn't be surprised if you hand-crafted a fake version of Pokemon Blue and the Gemini Plays Pokemon scaffold was able to solve it. I'd wager that the reason why Gemini Plays Pokemon finishes the game while Claude Plays Pokemon gets stuck has less to do with Gemini > Claude or more data leakage, and more to do with the design of their scaffolds. We also see this in our VideoGameBench paper, where minimizing the available scaffold leads to frequent "stuck" behavior regardless of what frontier VLM you use. Super excited about this effort though, and perhaps deploying similar agents on this new game benchmark and VideoGameBench will give us more perspective on where we are with embodied agents :)

Profilbild von shawn swyx wang
shawn swyx wangvor 1 Jahr

@GregKamradt ominous

Profilbild von Chris
Chrisvor 1 Jahr

@GregKamradt So exciting how you guys are already on ARC AGI 3. Do you think that will be the last one before we hit AGI 👀

Profilbild von vmal
vmalvor 1 Jahr

@GregKamradt why arc agi has it wrong

Profilbild von Yehyun
Yehyunvor 1 Jahr

@GregKamradt This is needed benchmark since it will also represents how well these systems track their long term memory

Ähnliche Videos

François Chollet (François Chollet) has spent years asking a different question than most of the AI world. Instead of scaling what already works, he’s trying to understand what intelligence actually is and how to build it from first principles. In this episode of the Lightcone Podcast, he traces that path from his early work on deep learning to the creation of the ARC Prize, and the launch of ARC V3, a new benchmark designed to measure something deeper than performance: the ability to learn, adapt, and reason efficiently in entirely new environments. He explains why today’s systems may be hitting limits, what recent breakthroughs really mean, and why reaching true general intelligence may require a fundamentally different approach. 00:00 - AGI by 2030? 00:31 - Introducing Ndea: A New Path Beyond Deep Learning 01:08 - A New ML Paradigm 01:30 - Replacing neural nets with compact symbolic programs 03:04 - Why Ndea Isn’t Competing With Coding Agents 05:20 - Why Everyone Might Be Wrong About Scaling LLMs 07:22 - Why Coding Agents Suddenly Work So Well 08:50 - The Limits of LLMs in Non-Verifiable Domains 10:48 - What AGI Actually Means (And Why Most Definitions Are Wrong) 13:30 - Why Deep Learning Hits a Wall 14:00 - ARC’s Origin Story 18:20 - ARC Benchmarks Explained: From V1 to V3 22:49 - The RL Loop Powering Coding Agents Today 27:03 - ARC-AGI V3: Measuring “Agentic Intelligence” 31:14 - Inside the ARC Game Studio 35:31 - Could AGI Fit in 10,000 Lines of Code? 44:01 - Building Ndea: From Idea to Compounding Research Stack 46:46 - The Future of ARC: Benchmarks That Evolve With AI 47:21 - Why There’s Still Huge Opportunity for New AI Paradigms 53:37 - How to Build a Breakout Open Source Project - Lessons From Keras 56:39 - Advice For How To Think About AI

Y Combinator

150,846 Aufrufe • vor 2 Monaten