Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Robotics keeps hitting the same wall. Single task RL works, but... it does not scale to hundreds of tasks or new embodiments. This new paper looks like a real step toward fixing that. The team introduces MMBench, a benchmark with 200 tasks across many domains and robots, and Newt,...

70,090 görüntüleme • 7 ay önce •via X (Twitter)

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Don't train the model, evolve the harness. I read a brilliant blog post from Hugging Face where they took a frozen open model scoring 0% on a hard legal agent benchmark, left its weights alone, and let an automated loop rewrite only the code around it. That code layer is the harness, the runtime wrapper that feeds the model context, runs its tool calls, and decides when a run ends. By the time the loop finished, the system had essentially matched Sonnet 4.6 on the benchmark's headline metric, at roughly 7x lower cost per task. Zero weights changed. The gain existed because of where the model was failing. The judge only grades files saved in the right place under the exact requested filename, and the model kept doing the legal analysis correctly, then saving it under the wrong name, dropping it in a scratch folder, or never writing it at all. So the 0% was never measuring legal reasoning. It was measuring the harness. Hand-tuning that layer is slow and model-specific, so they automated it. A Claude proposer adds exactly one mechanism per iteration, and an outer loop keeps it only if it clearly beats the current best, so accepted mechanisms compound. What the loop discovered says a lot about where agents actually fail. → The biggest single gain was file handling, not intelligence. An automatic step that lands the deliverable exactly where the judge expects it beat every prompt change, with zero extra model tokens. → Code fixes transferred across models, prompt playbooks did not. The same harness lifted a smaller model from the same family by 14 points, but the tuned prompts hurt a different model family on tasks it could already finish. → The harness mattered more than anything else. Same model, same judge, same tasks, and five different harnesses scored anywhere between 3.5% and 80.1%. The gains do eventually flatten, and the remaining misses look like real capability gaps. At some point the wrapper runs out of tricks and the model has to carry the work. But the lesson holds. A benchmark score measures the model and its harness together, and until the harness is fixed, it's impossible to know which one failed. I highly recommend reading this: I also wrote a deep dive on agent harness engineering a while back, covering the orchestration loop, tools, memory, context management, and everything that turns a stateless LLM into a capable agent. The article is quoted below.

Akshay 🚀

227,260 görüntüleme • 2 gün önce