Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

52,761 Aufrufe • vor 1 Monat •via X (Twitter)

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

I designed a new test specifically for multimodal models: fill out a paper form. And it's much harder than it sounds. This isn't typing into an electronic field that captures your text. The form is just an image. The model has to place each form element: text, checkmarks — at the correct pixel position on the canvas itself. Results: 🟢 Kimi K2.6 → done in 3:45, 16.7k output tokens 🟡 Step 3.7 Flash → half the fields, 57k output tokens 🔴 Gemini 3.5 Flash → 489k output tokens, never finished. I had to kill it. Gemini burned ~29x more output tokens than Kimi on the exact same task, and Kimi's was the only form that actually looked filled out. The test, a mocked application form, contains some challenging parts, such as one-character-per-box fields. I provided every model the same set of tools: > get canvas size > drop probe markers to find coordinates > add text > add checkmarks > move elements > take a screenshot anytime to check their own work > ... etc So it's vision + spatial reasoning + tool use + long context, all at once. Small models (Qwen, Gemma) can't really complete this test, so I skipped them. What happened: > Kimi nailed name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code — placement slightly loose, but content correct. 15 turns. Clean. > Step got maybe half right — fields dropped, "United States" landed in the email line, data floating outside boxes. Burned 1.24M input tokens doing it (81 turns of re-reading the canvas). > Gemini almost got there visually... then spiraled. By turn 40 it was issuing a delete_elements call wiping element IDs 365–425, basically erasing its own work. 31 minutes, 489k output tokens, still streaming. Terminated. The takeaway isn't "Gemini bad." This test is indeed difficult. But token efficiency is capability now. A model that needs 30x the tokens and still can't converge is going to be 30x the cost in production. Kimi K2.6 just quietly did the thing.

stevibe

25,304 Aufrufe • vor 19 Tagen