正在加载视频...

视频加载失败

Let me explain the agent loop, simple It's the core of every agentic system, and the part most people overcomplicate It's just this: 1. Send messages to the model 2. Model responds, maybe calls a tool 3. You run the tool 4. Append the result back to messages 5....

12,514 次观看 • 26 天前 •via X (Twitter)

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Don't train the model, evolve the harness. I read a brilliant blog post from Hugging Face where they took a frozen open model scoring 0% on a hard legal agent benchmark, left its weights alone, and let an automated loop rewrite only the code around it. That code layer is the harness, the runtime wrapper that feeds the model context, runs its tool calls, and decides when a run ends. By the time the loop finished, the system had essentially matched Sonnet 4.6 on the benchmark's headline metric, at roughly 7x lower cost per task. Zero weights changed. The gain existed because of where the model was failing. The judge only grades files saved in the right place under the exact requested filename, and the model kept doing the legal analysis correctly, then saving it under the wrong name, dropping it in a scratch folder, or never writing it at all. So the 0% was never measuring legal reasoning. It was measuring the harness. Hand-tuning that layer is slow and model-specific, so they automated it. A Claude proposer adds exactly one mechanism per iteration, and an outer loop keeps it only if it clearly beats the current best, so accepted mechanisms compound. What the loop discovered says a lot about where agents actually fail. → The biggest single gain was file handling, not intelligence. An automatic step that lands the deliverable exactly where the judge expects it beat every prompt change, with zero extra model tokens. → Code fixes transferred across models, prompt playbooks did not. The same harness lifted a smaller model from the same family by 14 points, but the tuned prompts hurt a different model family on tasks it could already finish. → The harness mattered more than anything else. Same model, same judge, same tasks, and five different harnesses scored anywhere between 3.5% and 80.1%. The gains do eventually flatten, and the remaining misses look like real capability gaps. At some point the wrapper runs out of tricks and the model has to carry the work. But the lesson holds. A benchmark score measures the model and its harness together, and until the harness is fixed, it's impossible to know which one failed. I highly recommend reading this: I also wrote a deep dive on agent harness engineering a while back, covering the orchestration loop, tools, memory, context management, and everything that turns a stateless LLM into a capable agent. The article is quoted below.

Akshay 🚀

229,774 次观看 • 2 天前

The architecture of this new world model is one of the most interesting things I've seen lately: Let me first explain how most world models work: They predict and render one frame at a time. If you are navigating in one of these worlds, and you look left, the model draws whatever looks right in the moment. Every time you change your viewpoint, the model has to imagine what should be there again, so it's very common for these models to "forget" what's in the world. For example, if you put a toy on the table, look away, then look back, the toy might not be there anymore. Tripo AI is releasing its Project Eden model, which works very differently: The model builds the world first, and then renders it based on that map. That map holds the real state of the world: the geometry, every object, where things are, what's already happened. The picture you see on screen gets generated from the map. This architecture flips the whole thing. Now, you get the following: 1. The world stops forgetting. Leave, come back, and the toy is still on the table because it lives in the map, not in the last frame you saw. 2. You can edit the world, and those changes persist for anyone who enters later. 3. Multiple people and AI agents can coexist in the world and see it from different perspectives. This is early research, but it's looking really promising. They just raised nearly $200M across two rounds to build it out. Tripo will be at SIGGRAPH 2026 (July 19–23, Los Angeles Convention Center). If you work in 3D, embodied AI, simulation, or anything spatial, go connect with them there.

Santiago

30,104 次观看 • 10 天前

Fable 5 comes back!It can now build playable game prototypes. I think it is actually a signal for where AI coding is going. Making a game is not just “write some code.” Even a small browser game needs: game loop;character movement;collision logic;scoring system;UI states;physics tuning;visual feedback;bug fixing;playtesting This is why game prototyping is a great test for AI models. A model cannot fake it with a pretty answer. Either the game runs, or it does not. What impressed me about Fable 5 is that it is useful for the messy middle: turning an idea into mechanics, turning mechanics into code, debugging broken interactions, and iterating until the prototype feels playable. But here is the practical part: I would not use the strongest model for every step. For game building, I would split the workflow: 1. Fable 5 for game design + architecture 2. a fast coding model for routine implementation 3. a vision-capable model for screenshot/UI feedback 4. a cheaper model for docs, test cases, and small fixes 5. fallback when latency, cost, or output quality becomes a problem That is the real AI coding stack. Not “one magic model does everything.” More like: the right model, for the right task, at the right cost, with fallback when things break. This is why I’ve been looking at ZenMux ZenMux. ZenMux gives developers one gateway to access multiple leading AI models, with OpenAI / Anthropic / Google Vertex compatible APIs, cost tracking, quality benchmarks, auto-routing, and compensation when output quality, latency, or throughput falls short. If AI can now make games, the next question is not just “which model is strongest?” It is:how do we manage the whole model workflow Fable 5 shows the creative ceiling. ZenMux is closer to the infrastructure layer you need when AI coding becomes a real production habit.

Rachel🥥

57,766 次观看 • 2 天前