Video wird geladen...
Video konnte nicht geladen werden
Interactive Reasoning Benchmarks are the next step in frontier evaluations Hear Greg Kamradt share why measuring human-like intelligence requires multi-turn environments Including a sneak peak of ARC-AGI-3 Want to help us build interactive evaluations? We're hiring
26,218 Aufrufe • vor 1 Jahr •via X (Twitter)
8 Kommentare

Calling Python Game Developers to help us create fun and challenging mini-games. This is a contract position for a remote game development role. Required Skills: * Strong Python * 2 years of game development experience Email [email protected] with your portfolio

This presentation was originally given at @aiDotEngineer on June 5, 2025 Slides:

Unveiling the Future of Prompt Engineering for Better AI Interactions #tech

Not entirely the same since we're not crafting tasks, but we're (me + @OfirPress) are also interested in benchmarking progress of a single agent / model across multiple games. In the video it's mentioned that we want to avoid data leakage (e.g. in Pokemon) and this is a factor for why Gemini Plays Pokemon succeeds. This is probably true (although it's hard to rigorously prove this fact) but arguably is not the primary issue here. I wouldn't be surprised if you hand-crafted a fake version of Pokemon Blue and the Gemini Plays Pokemon scaffold was able to solve it. I'd wager that the reason why Gemini Plays Pokemon finishes the game while Claude Plays Pokemon gets stuck has less to do with Gemini > Claude or more data leakage, and more to do with the design of their scaffolds. We also see this in our VideoGameBench paper, where minimizing the available scaffold leads to frequent "stuck" behavior regardless of what frontier VLM you use. Super excited about this effort though, and perhaps deploying similar agents on this new game benchmark and VideoGameBench will give us more perspective on where we are with embodied agents :)

@GregKamradt ominous

@GregKamradt So exciting how you guys are already on ARC AGI 3. Do you think that will be the last one before we hit AGI 👀

@GregKamradt why arc agi has it wrong

@GregKamradt This is needed benchmark since it will also represents how well these systems track their long term memory



