Loading video...

Video Failed to Load

Go Home

Don't train the model, evolve the harness. I read a brilliant blog post from Hugging Face where they took a frozen open model scoring 0% on a hard legal agent benchmark, left its weights alone, and let an automated loop rewrite only the code around it. That code layer...

228,936 views • 2 days ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

BREAKING: Anthropic just dropped Opus 4.8—and it is a MONSTER We've been testing for about a week Every 📧 and our verdict is they could've just called it Opus 5, it's that good. Here's our vibe check: - Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works. HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results. - Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context. HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high. - Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark. - Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic. THE BAD: These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude. Anthropic is back baby! Read the rest on Every 📧:

Dan Shipper 📧

352,617 views • 1 month ago

They did not take cursive from the schools because children no longer needed it. They took it because of what it was quietly building in them. Consider what the exercise actually is. A child, six years old, is handed a pen and asked to draw a single unbroken line that becomes a word. The wrist must float. The fingers must hold a living pressure, never quite the same twice, always correcting. The eye must follow the ink forward and trust the hand to finish what it has begun. There is no lifting, no stopping, no starting over mid-word. The loop must close. The ascender must rise and return. The sentence must travel from one margin to the other as a single continuous gesture, and at the end of it the hand must still be steady. Twelve years of this. Every day. Ten thousand small acts of sustained, self-correcting attention, carried out below the level of conscious thought, until the motion belongs to the body and the body belongs to the motion. This is not penmanship. It is the slow construction of an interior form. The hand that has learned to carry a line without breaking it is the hand of a mind that has learned to carry a thought without breaking it. The two are not metaphors for one another. They are the same faculty, trained in the same child, by the same daily discipline. Continuity of the stroke becomes continuity of the reasoning. The patience of the loop becomes the patience of the argument. The commitment to finish a word one has started becomes the commitment to finish a sentence, a paragraph, a life's idea, without reaching for the nearest distraction halfway through. Print is a different creature entirely. Print lifts. Print stops. Print assembles a word out of separate, stamped, interchangeable pieces, each one beginning and ending in isolation. A mind raised only on print learns to think the way print is made, in discrete tokens, in replaceable units, in fragments that can be recombined by any outside hand without the owner noticing the substitution. It is precisely the shape of thought a language model produces. It is precisely the shape of thought a language model can steer. Cursive is kata. This is the whole of it. A form repeated daily, for years, not for the sake of the form but for what the repetition lays down in the practitioner beneath the form. The swordsman does not train kata so that one day he may fight in kata. He trains it so that when the moment comes and there is no time to think, the movement is already inside him, older and deeper than thought, and it rises on its own. Cursive was the kata of the literate mind, the daily quiet drilling of continuity, of patience, of a line held steady under the long pressure of its own length. And the signature it produced at the end, that small flourished mark unique to a single human being on earth, was only the outward proof of an inward form no machine and no other hand could ever reproduce. Take the kata away and the practitioner is left with vocabulary in place of faculty. He can recognise a whole thought when he encounters one. He cannot carry one himself. He can admire a finished argument. He cannot sustain one long enough to close its loop. He begins books he does not finish, sentences he does not end, ideas he abandons the moment the screen in his palm offers him a brighter one. And when the machine begins feeding him tokens in the exact shape his schooling taught him to receive, he meets it with no interior resistance at all, because no interior form was ever built in him to push back with. They removed it quietly, across a generation, and they removed it in the last years before the machines arrived. Twelve years of daily practice in unbroken, embodied, self-authored thought, gone from the curriculum of almost every child in the Western world, just as the instruments designed to complete their sentences for them came online. The hand forgets. The mind, having never been taught the kata, forgets a thing it never knew it had. That is what cursive was. That is what was taken. And that is why the thought of anyone who still writes by hand, in long unlifted lines, remains, quietly, stubbornly, and without their ever needing to announce it, their own. Now the question stands open. What else has been banned, phased out, quietly retired from the curriculum and from common life over these same decades, under the same soft excuses? Mental arithmetic. Memorisation of poetry. Latin. Logic as a formal subject. Map reading. Knot work. The keeping of a commonplace book. The reading aloud of long passages in class. Singing in parts. What was each of those actually building in the child, beneath the surface of the lesson, and whose interest was served by its disappearance?

SiriusB

441,460 views • 2 months ago

Fable 5 comes back!It can now build playable game prototypes. I think it is actually a signal for where AI coding is going. Making a game is not just “write some code.” Even a small browser game needs: game loop;character movement;collision logic;scoring system;UI states;physics tuning;visual feedback;bug fixing;playtesting This is why game prototyping is a great test for AI models. A model cannot fake it with a pretty answer. Either the game runs, or it does not. What impressed me about Fable 5 is that it is useful for the messy middle: turning an idea into mechanics, turning mechanics into code, debugging broken interactions, and iterating until the prototype feels playable. But here is the practical part: I would not use the strongest model for every step. For game building, I would split the workflow: 1. Fable 5 for game design + architecture 2. a fast coding model for routine implementation 3. a vision-capable model for screenshot/UI feedback 4. a cheaper model for docs, test cases, and small fixes 5. fallback when latency, cost, or output quality becomes a problem That is the real AI coding stack. Not “one magic model does everything.” More like: the right model, for the right task, at the right cost, with fallback when things break. This is why I’ve been looking at ZenMux ZenMux. ZenMux gives developers one gateway to access multiple leading AI models, with OpenAI / Anthropic / Google Vertex compatible APIs, cost tracking, quality benchmarks, auto-routing, and compensation when output quality, latency, or throughput falls short. If AI can now make games, the next question is not just “which model is strongest?” It is:how do we manage the whole model workflow Fable 5 shows the creative ceiling. ZenMux is closer to the infrastructure layer you need when AI coding becomes a real production habit.

Rachel🥥

57,766 views • 2 days ago