Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

When Ilya Sutskever once explained why next-word prediction leads to intelligence, he made a metaphor: if you can piece together the clues and deduce the criminal’s name on the last page, you have a real understanding of the story. 🕵️‍♂️ Inspired by that idea, we turned to Ace Attorney...

999,231 Aufrufe • vor 1 Jahr •via X (Twitter)

12 Kommentare

Profilbild von Hao AI Lab
Hao AI Labvor 1 Jahr

Phoenix Wright Ace Attorney is a popular visual novel known for its complex storytelling and courtroom drama. Like a detective novel, it challenges players to connect clues and evidence to expose contradictions and reveal the true culprit. In our setup, models are tested on the intense cross-examination stage. It must spot contradictions and present the correct evidence to challenge witness testimony. Each level grants 5 lives, allowing limited tolerance for mistakes.

Profilbild von Hao AI Lab
Hao AI Labvor 1 Jahr

🔍 Interesting Findings: We tested 4 top AI multimodal models: O1, Gemini 2.5 Pro, Claude 3.7-thinking, and Llama-4 Maverick. 1. O1 and Gemini 2.5 Pro performed the best, both reaching Level 4 🏅. While neither managed to crack it, O1 had a slight edge over Gemini 2.5 in tackling the toughest cases. 2.GPT-4.1 showed similar performance to Claude 3.5. Despite reported gains over GPT-4o, in this task it’s only on par with older models.

Profilbild von Hao AI Lab
Hao AI Labvor 1 Jahr

🧠 Task Analysis — Why It’s Hard: 1. Long-context Reasoning - Spot contradictions by cross-referencing with prior dialogue and evidence. 2. Visual Understanding - Identify the exact image that disproves false claims with precising grounding. 3. Strategic Decision-Making (Game Design) - Decide when to press, present evidence, or hold back - it’s not just about answers, but making the right move in a dynamic, evolving case. Thoughts: Game design pushes AI beyond pure textual and visual tasks by requiring it to convert understanding into context-aware actions. It is harder to overfit because success here demands reasoning over context-aware action space - not just memorization.

Profilbild von Hao AI Lab
Hao AI Labvor 1 Jahr

When it comes to cost-efficiency, Gemini 2.5 Pro redefines the value.⚡️ With comparable performance, it’s 6 to 15 times cheaper than O1-2024-12-17, depending on the case.💸 Gemini 2.5 Pro is even slightly cheaper than GPT-4.1 ($1.25 vs $2.00 per 1M input tokens). In our table for models that passed Level 1, O1 made the fewest API calls but still had the highest cost. The call count reflects strategy, not reasoning strength, as models that dig deeper into testimony naturally trigger more requests. Beyond Level 1, as conversations get longer, O1’s cost skyrockets. 🚀 In Level 2, which is a really long case, O1 cost over $45.75, while Gemini 2.5 Pro handled it for $7.89. That’s a massive gap! 💸 Note: Gemini uses a built-in token counting method that treats all images as 258 tokens for gemini-2.5-pro model, so actual costs may be slightly higher. O1’s output may also be underestimated due to variability in its hidden reasoning content.

Profilbild von Hao AI Lab
Hao AI Labvor 1 Jahr

We’re committed to building more transparent, robust, and innovative AI benchmarks and would love to hear your ideas. Drop your thoughts about games and evaluations below, we’re always open to new suggestions for advancing AI evaluation! 💡📊 Leaderboard: Github Repo: Official Website:

Profilbild von Parroted Words
Parroted Wordsvor 1 Jahr

If the fear of Yahweh is the beginning of wisdom, then what is its end? Cut through the abundant nonsense, empty platitudes, and conventional musings of the world and get straight to the heart of what it means to be wise.

Profilbild von Janek Mann
Janek Mannvor 1 Jahr

That’s a fun benchmark! Would be interesting to RL-tune a model on it.

Profilbild von Hao AI Lab
Hao AI Labvor 1 Jahr

We are on it, stay tuned! 😃

Profilbild von Shailesh
Shaileshvor 1 Jahr

Is the code for the implementation available somewhere?

Profilbild von kfant
kfantvor 1 Jahr

code?

Profilbild von Seth Stafford
Seth Staffordvor 1 Jahr

Clever idea. But what I really need is an AI that can explain the plot of “The Big Sleep”. 😉

Profilbild von Hao AI Lab
Hao AI Labvor 1 Jahr

Interesting, curious to see what kinds of stories each model would come up with 🤔

Ähnliche Videos