Video wird geladen...
Video konnte nicht geladen werden
Phoenix Wright Ace Attorney is a popular visual novel known for its complex storytelling and courtroom drama. Like a detective novel, it challenges players to connect clues and evidence to expose contradictions and reveal the true culprit. In our setup, models are tested on the intense cross-examination stage. It... show more
29,983 Aufrufe • vor 1 Jahr •via X (Twitter)
7 Kommentare

When Ilya Sutskever once explained why next-word prediction leads to intelligence, he made a metaphor: if you can piece together the clues and deduce the criminal’s name on the last page, you have a real understanding of the story. 🕵️♂️ Inspired by that idea, we turned to Ace Attorney to test AI's reasoning. It’s the perfect stage: the AI plays as a detective to collect clues, expose contradictions, and uncover the truth. We put the latest top AI models—GPT-4.1, Gemini 2.5 Pro, Llama-4 Maverick, and more—to the test in Ace Attorney, to see if they could shout Objection! ⚖️, turn the case around, and uncover the truth behind the lies.

🔍 Interesting Findings: We tested 4 top AI multimodal models: O1, Gemini 2.5 Pro, Claude 3.7-thinking, and Llama-4 Maverick. 1. O1 and Gemini 2.5 Pro performed the best, both reaching Level 4 🏅. While neither managed to crack it, O1 had a slight edge over Gemini 2.5 in tackling the toughest cases. 2.GPT-4.1 showed similar performance to Claude 3.5. Despite reported gains over GPT-4o, in this task it’s only on par with older models.

🧠 Task Analysis — Why It’s Hard: 1. Long-context Reasoning - Spot contradictions by cross-referencing with prior dialogue and evidence. 2. Visual Understanding - Identify the exact image that disproves false claims with precising grounding. 3. Strategic Decision-Making (Game Design) - Decide when to press, present evidence, or hold back - it’s not just about answers, but making the right move in a dynamic, evolving case. Thoughts: Game design pushes AI beyond pure textual and visual tasks by requiring it to convert understanding into context-aware actions. It is harder to overfit because success here demands reasoning over context-aware action space - not just memorization.

When it comes to cost-efficiency, Gemini 2.5 Pro redefines the value.⚡️ With comparable performance, it’s 6 to 15 times cheaper than O1-2024-12-17, depending on the case.💸 Gemini 2.5 Pro is even slightly cheaper than GPT-4.1 ($1.25 vs $2.00 per 1M input tokens). In our table for models that passed Level 1, O1 made the fewest API calls but still had the highest cost. The call count reflects strategy, not reasoning strength, as models that dig deeper into testimony naturally trigger more requests. Beyond Level 1, as conversations get longer, O1’s cost skyrockets. 🚀 In Level 2, which is a really long case, O1 cost over $45.75, while Gemini 2.5 Pro handled it for $7.89. That’s a massive gap! 💸 Note: Gemini uses a built-in token counting method that treats all images as 258 tokens for gemini-2.5-pro model, so actual costs may be slightly higher. O1’s output may also be underestimated due to variability in its hidden reasoning content.

We’re committed to building more transparent, robust, and innovative AI benchmarks and would love to hear your ideas. Drop your thoughts about games and evaluations below, we’re always open to new suggestions for advancing AI evaluation! 💡📊 Leaderboard: Github Repo: Official Website:

A small teaser of Astromeda! It's a game inspired by Pokemon, Undertale, and OneShot. Wishlist on Steam! #TrailerTuesday #indiegame #IndieGameDev #GamingNews #Steam

Isn't most of the game already in the training data?


