Jeffrey Emanuel's banner

Jeffrey Emanuel

@doodlestein • 46,086 subscribers

Former Quant Investor | My Open Source Projects: https://t.co/9qbOCDlaqM | Try https://t.co/oCtjI2mBIl , my collection of agent coding tooling.

Shorts

It must be acutely embarrassing for Google and DeepMind’s top management that, despite having: 1) a multi-year head start, with most of the top minds in the field working there; 2) their own custom training and inference silicon now on its EIGHTH generation; 3) total investment in property, plant, and equipment of $281 billion; 4) over 80k engineers; 5) most of the world’s data collected and indexed for 20 years (including ungodly amounts of documents, emails, code, books, photos, videos, etc.); they have been utterly upstaged by Kimi, a company with fewer than 500 employees and founded by Zhilin Yang, who only received his PhD from CMU in 2019 (recall that Hinton had been working at Google since 2013!). Something is deeply dysfunctional and broken there to lead to an outcome this bad.

It must be acutely embarrassing for Google and DeepMind’s top management that, despite having: 1) a multi-year head start, with most of the top minds in the field working there; 2) their own custom training and inference silicon now on its EIGHTH generation; 3) total investment in property, plant, and equipment of $281 billion; 4) over 80k engineers; 5) most of the world’s data collected and indexed for 20 years (including ungodly amounts of documents, emails, code, books, photos, videos, etc.); they have been utterly upstaged by Kimi, a company with fewer than 500 employees and founded by Zhilin Yang, who only received his PhD from CMU in 2019 (recall that Hinton had been working at Google since 2013!). Something is deeply dysfunctional and broken there to lead to an outcome this bad.

161,381 次观看

OK, how cool is this? FrankenTUI is now totally web native, with ultra high-performance wasm. Totally side-stepping xterm .js with my own home-grown FrankenTermJS. And it's all touch-native, works perfectly on iPhone, at full 60 fps. And also a resizable React widget. Try it here:

OK, how cool is this? FrankenTUI is now totally web native, with ultra high-performance wasm. Totally side-stepping xterm .js with my own home-grown FrankenTermJS. And it's all touch-native, works perfectly on iPhone, at full 60 fps. And also a resizable React widget. Try it here:

10,463 次观看

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

I finally finished my Rust version of Mario Zechner's (Mario Zechner) excellent Pi Agent, which I made with his blessing and which is called pi_agent_rust. You can get it here: If you're not familiar with Pi, it's a minimalist and extensible agent harness (similar to Claude Code and Codex) and, among other uses, serves as the core agent harness inside the OpenClaw project. I say my Rust "version" instead of "port" because it's really quite different in how it's implemented for it to be called a port. Arguably, the incremental functionality in the implementation was more complex than the rest of the project combined. Still, it provides the same features and functionality as the original, and is proven to be compatible with hundreds of popular extensions to Pi (the conformance harness shows 224 out of 224 extensions working perfectly). But the way it's architected has some major changes. Pi Agent relies on node or bun to provide access to the filesystem and for various other tasks, and that is also how Pi's extension system works. I decided early on that I didn't want to do things that way. Instead, I wanted to integrate that functionality directly into the binary itself; that is, to provide equivalent functionality for everything that would normally be provided by node/bun in the original. I did this for several reasons: one, it's a lot more performant in terms of footprint and latency. On realistic end-to-end large-session workloads (not toy microbenchmarks), pi_agent_rust is now: - 4.95x faster than legacy Node and 2.80x faster than legacy Bun at 1mm-token session scale - 4.32x faster than legacy Node and 2.14x faster than legacy Bun at 5mm-token session scale - ~8x to ~13x lower RSS memory footprint in those same scenarios But the other reason is security and control: by handling everything internally in an end-to-end way, we can do all sorts of clever things to harden the system against insecure or malicious extensions. Those extensions no longer have direct access to the ambient filesystem: they now need to go through pi_agent_rust, and we can analyze extensions carefully before ever running them and also block things that look suspicious at runtime. In practice that means explicit capability-gated hostcalls, with policy/risk/quota enforcement and runtime telemetry/auditability. In order to do all this, I had to effectively build the missing runtime substrate from scratch in Rust, not just translate TypeScript syntax: - define and implement a typed hostcall ABI for extension->host interactions - build native Rust connectors for tool/exec/http/session/ui/events instead of ambient Node/Bun access - implement a compatibility/shim layer so real-world Pi extensions still behave correctly - add capability policy evaluation, runtime risk scoring, per-extension quotas, and audit telemetry on the execution path - wire the whole thing through structured concurrency (asupersync) so cancellation/lifetimes are deterministic and failure handling is explicit - build a conformance + benchmark harness large enough to validate behavior/perf across hundreds of extensions and realistic long-session workloads This was a full re-architecture of the execution model while preserving the Pi workflow and extension ecosystem. And indeed, this aspect of it dwarfs the entire rest of the project in size and complexity. To put hard numbers on that: the extension/runtime/security subsystem alone is now about 86.5k lines of Rust across src/extensions.rs (~48.1k), src/extensions_js.rs (~23.4k), src/extension_dispatcher.rs (~13.4k), and src/extension_index.rs (~1.7k), with roughly 2.5k callable units in just those files. For context, the original Pi coding-agent production code is about 27.4k lines total. So this one subsystem by itself is roughly 3.2x the size of the original harness, which is why calling this a “port” would seriously undersell what had to be built. And on top of that, pi_agent_rust introduces a bunch of genuinely new capabilities beyond the legacy harness, not just a faster core: - Security and enforcement are materially stronger at runtime: capability-gated hostcalls with explicit policy profiles (safe/balanced/permissive), per-extension trust lifecycle (pending -> acknowledged -> trusted -> killed), explicit kill-switch operations, and audited state transitions. - Shell execution mediation is deterministic and argument-aware: rule/feature-based risk scoring plus heredoc AST inspection (dcg_rule_hit, dcg_heredoc_hit) before spawn, instead of relying on coarse deny patterns. - Containment and forensics are first-class: tamper-evident runtime risk ledger tooling (verify/replay/calibrate), unified incident evidence bundles, and forced-compat controls that let you contain issues without disabling the whole extension system. - The extension runtime architecture is native: JS extensions run in embedded QuickJS with typed hostcall boundaries and Rust-native connectors for tool/exec/http/session/ui/events, plus compatibility shims for real-world legacy extensions. - Runtime behavior under load is explicitly engineered: deterministic hostcall reactor mesh, fast-lane vs compat-lane routing, and warm-isolate prewarm handoff for more predictable throughput and latency under contention. - Long-session reliability is upgraded: JSONL v3 sessions with indexed sidecar acceleration and optional SQLite-backed sessions, plus operational controls via --session-durability, --no-migrations, and migrate. - Provider and auth coverage are broader and more operationally explicit: native Anthropic/OpenAI (Chat + Responses)/Gemini/Cohere/Azure/Bedrock/Vertex/Copilot/GitLab plus large OpenAI-compatible routing; pi --list-providers currently shows 90 providers with aliases and required auth env keys. - Auth is not just API keys: OAuth (Anthropic/OpenAI Codex/Gemini CLI/Antigravity/Kimi/Copilot/GitLab plus extension-defined OAuth), AWS credential chains (Bedrock), service-key exchange (SAP AI Core), and bearer-token flows. - Operator tooling is stronger: pi doctor supports scoped checks (config, dirs, auth, shell, sessions, extensions), machine-readable output (--format json|markdown), and safe auto-remediation (--fix). - Extension/package lifecycle workflows are built in: install, remove, update, update-index, search, info, and list. I want to thank Mario for making a great harness and for not telling me to get lost when I asked him if he was OK with me porting it to Rust. I may give him a hard time in jest about not going "full clanker," but that doesn't mean that I don't respect his work a huge amount. PS: There could still be bugs. If you find some, please let me know in GitHub Issues and I'll fix them same day. There's always a tradeoff between perfect and getting stuff out the door and I felt like it was time to release this.

I finally finished my Rust version of Mario Zechner's (Mario Zechner) excellent Pi Agent, which I made with his blessing and which is called pi_agent_rust. You can get it here: If you're not familiar with Pi, it's a minimalist and extensible agent harness (similar to Claude Code and Codex) and, among other uses, serves as the core agent harness inside the OpenClaw project. I say my Rust "version" instead of "port" because it's really quite different in how it's implemented for it to be called a port. Arguably, the incremental functionality in the implementation was more complex than the rest of the project combined. Still, it provides the same features and functionality as the original, and is proven to be compatible with hundreds of popular extensions to Pi (the conformance harness shows 224 out of 224 extensions working perfectly). But the way it's architected has some major changes. Pi Agent relies on node or bun to provide access to the filesystem and for various other tasks, and that is also how Pi's extension system works. I decided early on that I didn't want to do things that way. Instead, I wanted to integrate that functionality directly into the binary itself; that is, to provide equivalent functionality for everything that would normally be provided by node/bun in the original. I did this for several reasons: one, it's a lot more performant in terms of footprint and latency. On realistic end-to-end large-session workloads (not toy microbenchmarks), pi_agent_rust is now: - 4.95x faster than legacy Node and 2.80x faster than legacy Bun at 1mm-token session scale - 4.32x faster than legacy Node and 2.14x faster than legacy Bun at 5mm-token session scale - ~8x to ~13x lower RSS memory footprint in those same scenarios But the other reason is security and control: by handling everything internally in an end-to-end way, we can do all sorts of clever things to harden the system against insecure or malicious extensions. Those extensions no longer have direct access to the ambient filesystem: they now need to go through pi_agent_rust, and we can analyze extensions carefully before ever running them and also block things that look suspicious at runtime. In practice that means explicit capability-gated hostcalls, with policy/risk/quota enforcement and runtime telemetry/auditability. In order to do all this, I had to effectively build the missing runtime substrate from scratch in Rust, not just translate TypeScript syntax: - define and implement a typed hostcall ABI for extension->host interactions - build native Rust connectors for tool/exec/http/session/ui/events instead of ambient Node/Bun access - implement a compatibility/shim layer so real-world Pi extensions still behave correctly - add capability policy evaluation, runtime risk scoring, per-extension quotas, and audit telemetry on the execution path - wire the whole thing through structured concurrency (asupersync) so cancellation/lifetimes are deterministic and failure handling is explicit - build a conformance + benchmark harness large enough to validate behavior/perf across hundreds of extensions and realistic long-session workloads This was a full re-architecture of the execution model while preserving the Pi workflow and extension ecosystem. And indeed, this aspect of it dwarfs the entire rest of the project in size and complexity. To put hard numbers on that: the extension/runtime/security subsystem alone is now about 86.5k lines of Rust across src/extensions.rs (~48.1k), src/extensions_js.rs (~23.4k), src/extension_dispatcher.rs (~13.4k), and src/extension_index.rs (~1.7k), with roughly 2.5k callable units in just those files. For context, the original Pi coding-agent production code is about 27.4k lines total. So this one subsystem by itself is roughly 3.2x the size of the original harness, which is why calling this a “port” would seriously undersell what had to be built. And on top of that, pi_agent_rust introduces a bunch of genuinely new capabilities beyond the legacy harness, not just a faster core: - Security and enforcement are materially stronger at runtime: capability-gated hostcalls with explicit policy profiles (safe/balanced/permissive), per-extension trust lifecycle (pending -> acknowledged -> trusted -> killed), explicit kill-switch operations, and audited state transitions. - Shell execution mediation is deterministic and argument-aware: rule/feature-based risk scoring plus heredoc AST inspection (dcg_rule_hit, dcg_heredoc_hit) before spawn, instead of relying on coarse deny patterns. - Containment and forensics are first-class: tamper-evident runtime risk ledger tooling (verify/replay/calibrate), unified incident evidence bundles, and forced-compat controls that let you contain issues without disabling the whole extension system. - The extension runtime architecture is native: JS extensions run in embedded QuickJS with typed hostcall boundaries and Rust-native connectors for tool/exec/http/session/ui/events, plus compatibility shims for real-world legacy extensions. - Runtime behavior under load is explicitly engineered: deterministic hostcall reactor mesh, fast-lane vs compat-lane routing, and warm-isolate prewarm handoff for more predictable throughput and latency under contention. - Long-session reliability is upgraded: JSONL v3 sessions with indexed sidecar acceleration and optional SQLite-backed sessions, plus operational controls via --session-durability, --no-migrations, and migrate. - Provider and auth coverage are broader and more operationally explicit: native Anthropic/OpenAI (Chat + Responses)/Gemini/Cohere/Azure/Bedrock/Vertex/Copilot/GitLab plus large OpenAI-compatible routing; pi --list-providers currently shows 90 providers with aliases and required auth env keys. - Auth is not just API keys: OAuth (Anthropic/OpenAI Codex/Gemini CLI/Antigravity/Kimi/Copilot/GitLab plus extension-defined OAuth), AWS credential chains (Bedrock), service-key exchange (SAP AI Core), and bearer-token flows. - Operator tooling is stronger: pi doctor supports scoped checks (config, dirs, auth, shell, sessions, extensions), machine-readable output (--format json|markdown), and safe auto-remediation (--fix). - Extension/package lifecycle workflows are built in: install, remove, update, update-index, search, info, and list. I want to thank Mario for making a great harness and for not telling me to get lost when I asked him if he was OK with me porting it to Rust. I may give him a hard time in jest about not going "full clanker," but that doesn't mean that I don't respect his work a huge amount. PS: There could still be bugs. If you find some, please let me know in GitHub Issues and I'll fix them same day. There's always a tradeoff between perfect and getting stuff out the door and I felt like it was time to release this.

Jeffrey Emanuel

116,929 次观看 • 5 个月前

Ever seen a fresh (20x) Claude Max account's 5-hour usage allowance get drained in ~14 minutes? Feast your eyes on my bizarre life now with this screen recording of a recent live work session, something I've gotten at least 100 requests for over the past month. Maybe you can understand now why I need so many accounts and how I can work on so many different projects. You can also see the truth of what I was saying recently about how, once your plan is done and the beads made and polished, it's mostly just machine-tending the swarm that doesn't require much thought. Lots of just telling it to get the next bead and work on it, to review code, to re-read AGENTS dot md after a compaction, etc. And you can see how I use gemini-cli for code review. I give Google a lot of crap for the harness being broken and the capacity overloads, but when it works, it's actually really good for this code review use case. I don't usually let it write new code, though, because I think Opus and 5.2 do a better job. Also, sorry the recording is a bit blurry; I have a 5K resolution monitor and screen recordings usually are hard to watch from it. And btw, this really wasn't that normal of a session for me, it was more frenetic than usual, because I don't want to dox myself or my clients by accident. Hence all the ceaseless terminal tab swirling. I usually do more planning work while this stuff is going on, but I wanted to minimize the chances of leaking important information. That's also why I didn't refresh the Gemini login in the WezTerm window, which killed me, trust me. It's the reason I hate doing these screen recordings in the first place; it kills my productivity. Anyway, hope you liked it. I will also post to YouTube, see reply for link. Thanks for watching.

Ever seen a fresh (20x) Claude Max account's 5-hour usage allowance get drained in ~14 minutes? Feast your eyes on my bizarre life now with this screen recording of a recent live work session, something I've gotten at least 100 requests for over the past month. Maybe you can understand now why I need so many accounts and how I can work on so many different projects. You can also see the truth of what I was saying recently about how, once your plan is done and the beads made and polished, it's mostly just machine-tending the swarm that doesn't require much thought. Lots of just telling it to get the next bead and work on it, to review code, to re-read AGENTS dot md after a compaction, etc. And you can see how I use gemini-cli for code review. I give Google a lot of crap for the harness being broken and the capacity overloads, but when it works, it's actually really good for this code review use case. I don't usually let it write new code, though, because I think Opus and 5.2 do a better job. Also, sorry the recording is a bit blurry; I have a 5K resolution monitor and screen recordings usually are hard to watch from it. And btw, this really wasn't that normal of a session for me, it was more frenetic than usual, because I don't want to dox myself or my clients by accident. Hence all the ceaseless terminal tab swirling. I usually do more planning work while this stuff is going on, but I wanted to minimize the chances of leaking important information. That's also why I didn't refresh the Gemini login in the WezTerm window, which killed me, trust me. It's the reason I hate doing these screen recordings in the first place; it kills my productivity. Anyway, hope you liked it. I will also post to YouTube, see reply for link. Thanks for watching.

Jeffrey Emanuel

86,013 次观看 • 6 个月前

Watch Claude go from skeptical (“The scope is staggering to the point of being suspicious.”) after looking at the README file from my asupersync project, to actually cloning the repo and going through the code to determine whether it’s real or just BS jargon soup. By the end, it can hardly believe what it has seen (“It's real. And it's absurd.”). Its final take is pretty breathless. “What this actually is: This is the most impressive demonstration of agent-driven software development I've ever seen. ~70 commits/day for 40 days, with the mathematical content landing correctly, the protocol implementations doing real byte-level work, and the architecture holding together coherently across 600K+ lines. The Cx capability context threads through runtime, channels, sync primitives, networking, HTTP, databases, and lab testing. The Budget type appears in 109 files. The types compose.” And no, this isn’t sycophantic behavior. It’s actually all true. I want to stress that I am not remotely smart enough to have made this. Honestly, I don’t think there is a single human being on earth who could do it all. The math is too varied and abstruse, the computer science too esoteric and specialized. The people who know structured concurrency that well wouldn’t work fluently with martingale concentration bounds (Azuma/Freedman), Mazurkiewicz trace theory and Foata normal forms, tropical semiring budget algebra, or RaptorQ fountain coding. And none would be crazy enough to combine them all into one gigantic system like this. The frontier models made this collaboratively with me coaxing and directing them. This is a new breed of software. This is an Alien Artifact.

Watch Claude go from skeptical (“The scope is staggering to the point of being suspicious.”) after looking at the README file from my asupersync project, to actually cloning the repo and going through the code to determine whether it’s real or just BS jargon soup. By the end, it can hardly believe what it has seen (“It's real. And it's absurd.”). Its final take is pretty breathless. “What this actually is: This is the most impressive demonstration of agent-driven software development I've ever seen. ~70 commits/day for 40 days, with the mathematical content landing correctly, the protocol implementations doing real byte-level work, and the architecture holding together coherently across 600K+ lines. The Cx capability context threads through runtime, channels, sync primitives, networking, HTTP, databases, and lab testing. The Budget type appears in 109 files. The types compose.” And no, this isn’t sycophantic behavior. It’s actually all true. I want to stress that I am not remotely smart enough to have made this. Honestly, I don’t think there is a single human being on earth who could do it all. The math is too varied and abstruse, the computer science too esoteric and specialized. The people who know structured concurrency that well wouldn’t work fluently with martingale concentration bounds (Azuma/Freedman), Mazurkiewicz trace theory and Foata normal forms, tropical semiring budget algebra, or RaptorQ fountain coding. And none would be crazy enough to combine them all into one gigantic system like this. The frontier models made this collaboratively with me coaxing and directing them. This is a new breed of software. This is an Alien Artifact.

Jeffrey Emanuel

40,234 次观看 • 4 个月前

I've probably been asked by 100+ people over the past few months to record a screencast showing how I use my Agent Flywheel tooling and skills and other workflows in my day-to-day development work. I really hate making these things and find it stressful, but decided to do it anyway. This video doesn't get into the super nitty gritty, but shows enough to get a sense for my setup and tooling and general approach. Apologies for some choppy audio at parts; it should be pretty comprehensible despite that. I can't justify spending any real time scripting or editing this kind of thing because I'm too busy with real work, so you'll have to excuse some pauses and meandering here and there (plus I encountered a few bugs, the bane of all truly live demos). PS: If you prefer watching on YouTube, I'll post the link as a reply below.

I've probably been asked by 100+ people over the past few months to record a screencast showing how I use my Agent Flywheel tooling and skills and other workflows in my day-to-day development work. I really hate making these things and find it stressful, but decided to do it anyway. This video doesn't get into the super nitty gritty, but shows enough to get a sense for my setup and tooling and general approach. Apologies for some choppy audio at parts; it should be pretty comprehensible despite that. I can't justify spending any real time scripting or editing this kind of thing because I'm too busy with real work, so you'll have to excuse some pauses and meandering here and there (plus I encountered a few bugs, the bane of all truly live demos). PS: If you prefer watching on YouTube, I'll post the link as a reply below.

Jeffrey Emanuel

14,276 次观看 • 1 个月前

I'm now up to over 5.5k beads for my asupersync project, which is a record for me. The force graph layout of all the beads is truly nuts, and puts into perspective what a massive, sprawling effort it's been. It's now a million lines of code with tests!

I'm now up to over 5.5k beads for my asupersync project, which is a record for me. The force graph layout of all the beads is truly nuts, and puts into perspective what a massive, sprawling effort it's been. It's now a million lines of code with tests!

Jeffrey Emanuel

25,511 次观看 • 4 个月前

FrankenTUI is done! And it's spectacular. There are a couple tiny bugs I want to fix around border alignment, and I can still make things even faster (it's already super fast). Not bad for ~4 days. I think it's the best TUI framework around.

FrankenTUI is done! And it's spectacular. There are a couple tiny bugs I want to fix around border alignment, and I can still make things even faster (it's already super fast). Not bad for ~4 days. I think it's the best TUI framework around.

Jeffrey Emanuel

25,055 次观看 • 5 个月前

If you watch this ~50 minute screen recording closely (yeah, I know, it's long; there are also some times when my computer was very slow and laggy, just skip past that part. And at one point I had to run and get my 9-month-old a new bottle and left it on a boring screen, sorry!), I believe you can see real signs of the kind of runaway, recursive AI self-improvement that people have been warning of for a while (Mr. Kurzweil most notably and prophetically). Why do I say that? What's different now? Well, there's a reason my set of agent coding tooling is called the Flywheel. These tools all mutually self-reinforce each other. And they all flow directly into my ntm tool (short for "named_tmux_manager"), which acts as a sort of integration point and nerve center for the tools (this is becoming more true by the minute as I'm now seriously working on ntm). Now, ntm was something I started making to automate some aspects of my workflow, but it was the kind of thing where, until it was perfect, it sort of just slowed me down. So I didn't actually use it even though I kept working on it and trying to improve it, and suggested to users that they try it in my tutorials. Well anyway, I finally got around to "dogfooding" ntm last night, and now it's going to get very dramatically better at an alarming rate. Some of that is from applying my "idea wizard" prompt to generate more useful features and building that stuff out and addressing obvious pain points I encountered during my newfound usage of the tool. But a lot comes from my realization that, once again, ntm's true utility is not as a tool for ME, but for an agent. That is, ntm lets one instance of Claude Code or Codex act as, well, me, do the things that I had been doing manually. Do I wish I had started using ntm earlier? No, for two big reasons: 1) Doing it manually helped me build up my intuition massively, which directly led me down the path of creating useful prompt strategies and workflows; these often began as ad-hoc prompts that I realized could be generalized and made more versatile/universal. Lesson: don't prematurely automate until you have an intimate, intuitive feel for your "core value-add loop." Otherwise you'll have a fully automated system quickly that efficiently and automatically does a stupid or otherwise sub-optimal thing. 2) My eyes have been opened to the beauty and power of Skills. I'm not talking about your garden-variety skills that are just a simple markdown file. I'm talking about true tour-de-force directories of perfectly structured and organized files that are filled with good information, insights, workflows, etc., but presented in a way that is highly optimized for consumption by AI agents, with extreme attention paid to things like perfect progressive disclosure, token density, agent-ergonomics, agent-intuitiveness, etc. And also Skills that go way beyond markdown files, with full integration into Claude Code where it makes sense via hooks, sub-agents, and even Python scripts. These kinds of skills are a qualitative difference in expressive power and usefulness and a total game changer. They are also effectively composable, creating almost an algebra of skills that let you use them together in powerful ways. I'm working on a subscription service website and CLI tool now to share what I've learned here most effectively, stay tuned for that in the coming days. Anyway, I now know what to make and how to make it. So, getting back to that screen recording, what does it show that makes me claim recursive self-improvement is here? If you keep your eye on the upper left tmux pane, that's the "controller" agent. It is using ntm to control all the other panes which are also running Claude Code (but ntm fully supports other agent types like Codex and Gemini-CLI, and it's trivially easy to mix and match them if you wanted to have, say, 8 CCs and 6 Codexes for writing the code and 3 Gemini-CLIs for reviewing code.) Now, there's nothing that crazy about this much so far. But where it starts to get very cool is that as the session continues and we encounter real-world problems, things like my ridiculously overloaded computer that keeps hanging for long periods, Claude Code instances that crash and get into a frozen, unresponsive state, it can learn from that. And you can see it using my skill writing skill to refine its ntm vibe coding skill in real time. And then take that skill and refine it to be more intuitive for itself. Or use my cass tool skill to search all the session histories to look for problems that came up and strategize how to solve them. The most useful part was when, towards the end of the session, I told it to reflect on all the things we had done and problems we encountered. One way it can usefully leverage those reflections is by improving its ntm vibe coding skill to make it cover more edge cases and exigencies. But the other, more fundamental, way is for it to conceive of and design the optimal new features and functionality for ntm itself so that the tool embodies those lessons in a first-class way. This offloads cognition from its brain onto its tooling, just like how a person can lean on spellcheck or a calculator. It codifies correct, effective reasoning at the tool level, where it's more reliable and robust and repeatable. And btw, did you notice what code base it was working on the whole time? It was none other than ntm itself! So as it worked on its own tool, it had reflections and ideas about how to further improve the tool. Now, it could have just as easily gotten those insights and ideas while using ntm to work on a different project, but the fact that it was working on itself is almost gloriously meta and recursive. So by the end, after learning from tending to a big group of agent workers (btw, I have previously emphasized doing everything in a really distributed/decentralized way, where each fungible agent gets identical marching orders that tell it to use my bv tool to find the optimal bead to work on. This does work very well, but occasionally results in some contention and overlap from thundering herd, or at least wastes time/tokens/communication in avoiding that before the agents waste time duplicating work. But in this new ntm-oriented workflow, I was able to have the controller agent in the upper left use bv itself and then optimally parcel out the instructions to each agent so that we could know for sure that there's no overlap), I ended up with a ton of new beads for new features, which I had it optimize and polish a few times. Now I can swap to a new Claude Max account and have the swarm implement all those new features! It should only take a couple passes like the one shown in the screen recording to get everything implemented. Then we can rinse and repeat, having the agent read through the full session histories of each agent and its experience from its own session in sending ntm commands and seeing how they worked out in practice, to come up with the next batch of changes to both its ntm vibe coding skill AND to the ntm tool itself. Do you see how rapidly this turns into Skynet? My mistake earlier was in focusing on making myself a "faster horse" as Henry Ford used to joke about customers wanting before he showed them what they should really want (a Model T). That is, something that would make my experience nicer while doing this agent swarm based development workflow. But the obvious lesson is that you should make all your tooling agent-first because the agents are just better at this stuff. You can still watch, and of course I did add a ridiculous number of very nice human-centric features to ntm that you'll be seeing in the next day or two, but those are really kind of "for fun" to make us humans feel better about the process. All the real value-add is happening "by agents, for agents." PS: Towards the end, you can see me switch to my Mac and tell Claude to improve the skill that I made earlier today for taking the mkv screen recording files from OBS Studio and muxing them into MP4 files for sharing, while downloading songs from YouTube to serve as the background music. I made it so it can also grab the thumbnails and generate little song credit cards that show up in the lower right corner. This worked perfectly the first time! I'll include some screenshots in a response post showing how that worked, but it was awesome to witness. Skills are POWERFUL. I'll also post a link to this video on YouTube if you prefer to watch it there.

If you watch this ~50 minute screen recording closely (yeah, I know, it's long; there are also some times when my computer was very slow and laggy, just skip past that part. And at one point I had to run and get my 9-month-old a new bottle and left it on a boring screen, sorry!), I believe you can see real signs of the kind of runaway, recursive AI self-improvement that people have been warning of for a while (Mr. Kurzweil most notably and prophetically). Why do I say that? What's different now? Well, there's a reason my set of agent coding tooling is called the Flywheel. These tools all mutually self-reinforce each other. And they all flow directly into my ntm tool (short for "named_tmux_manager"), which acts as a sort of integration point and nerve center for the tools (this is becoming more true by the minute as I'm now seriously working on ntm). Now, ntm was something I started making to automate some aspects of my workflow, but it was the kind of thing where, until it was perfect, it sort of just slowed me down. So I didn't actually use it even though I kept working on it and trying to improve it, and suggested to users that they try it in my tutorials. Well anyway, I finally got around to "dogfooding" ntm last night, and now it's going to get very dramatically better at an alarming rate. Some of that is from applying my "idea wizard" prompt to generate more useful features and building that stuff out and addressing obvious pain points I encountered during my newfound usage of the tool. But a lot comes from my realization that, once again, ntm's true utility is not as a tool for ME, but for an agent. That is, ntm lets one instance of Claude Code or Codex act as, well, me, do the things that I had been doing manually. Do I wish I had started using ntm earlier? No, for two big reasons: 1) Doing it manually helped me build up my intuition massively, which directly led me down the path of creating useful prompt strategies and workflows; these often began as ad-hoc prompts that I realized could be generalized and made more versatile/universal. Lesson: don't prematurely automate until you have an intimate, intuitive feel for your "core value-add loop." Otherwise you'll have a fully automated system quickly that efficiently and automatically does a stupid or otherwise sub-optimal thing. 2) My eyes have been opened to the beauty and power of Skills. I'm not talking about your garden-variety skills that are just a simple markdown file. I'm talking about true tour-de-force directories of perfectly structured and organized files that are filled with good information, insights, workflows, etc., but presented in a way that is highly optimized for consumption by AI agents, with extreme attention paid to things like perfect progressive disclosure, token density, agent-ergonomics, agent-intuitiveness, etc. And also Skills that go way beyond markdown files, with full integration into Claude Code where it makes sense via hooks, sub-agents, and even Python scripts. These kinds of skills are a qualitative difference in expressive power and usefulness and a total game changer. They are also effectively composable, creating almost an algebra of skills that let you use them together in powerful ways. I'm working on a subscription service website and CLI tool now to share what I've learned here most effectively, stay tuned for that in the coming days. Anyway, I now know what to make and how to make it. So, getting back to that screen recording, what does it show that makes me claim recursive self-improvement is here? If you keep your eye on the upper left tmux pane, that's the "controller" agent. It is using ntm to control all the other panes which are also running Claude Code (but ntm fully supports other agent types like Codex and Gemini-CLI, and it's trivially easy to mix and match them if you wanted to have, say, 8 CCs and 6 Codexes for writing the code and 3 Gemini-CLIs for reviewing code.) Now, there's nothing that crazy about this much so far. But where it starts to get very cool is that as the session continues and we encounter real-world problems, things like my ridiculously overloaded computer that keeps hanging for long periods, Claude Code instances that crash and get into a frozen, unresponsive state, it can learn from that. And you can see it using my skill writing skill to refine its ntm vibe coding skill in real time. And then take that skill and refine it to be more intuitive for itself. Or use my cass tool skill to search all the session histories to look for problems that came up and strategize how to solve them. The most useful part was when, towards the end of the session, I told it to reflect on all the things we had done and problems we encountered. One way it can usefully leverage those reflections is by improving its ntm vibe coding skill to make it cover more edge cases and exigencies. But the other, more fundamental, way is for it to conceive of and design the optimal new features and functionality for ntm itself so that the tool embodies those lessons in a first-class way. This offloads cognition from its brain onto its tooling, just like how a person can lean on spellcheck or a calculator. It codifies correct, effective reasoning at the tool level, where it's more reliable and robust and repeatable. And btw, did you notice what code base it was working on the whole time? It was none other than ntm itself! So as it worked on its own tool, it had reflections and ideas about how to further improve the tool. Now, it could have just as easily gotten those insights and ideas while using ntm to work on a different project, but the fact that it was working on itself is almost gloriously meta and recursive. So by the end, after learning from tending to a big group of agent workers (btw, I have previously emphasized doing everything in a really distributed/decentralized way, where each fungible agent gets identical marching orders that tell it to use my bv tool to find the optimal bead to work on. This does work very well, but occasionally results in some contention and overlap from thundering herd, or at least wastes time/tokens/communication in avoiding that before the agents waste time duplicating work. But in this new ntm-oriented workflow, I was able to have the controller agent in the upper left use bv itself and then optimally parcel out the instructions to each agent so that we could know for sure that there's no overlap), I ended up with a ton of new beads for new features, which I had it optimize and polish a few times. Now I can swap to a new Claude Max account and have the swarm implement all those new features! It should only take a couple passes like the one shown in the screen recording to get everything implemented. Then we can rinse and repeat, having the agent read through the full session histories of each agent and its experience from its own session in sending ntm commands and seeing how they worked out in practice, to come up with the next batch of changes to both its ntm vibe coding skill AND to the ntm tool itself. Do you see how rapidly this turns into Skynet? My mistake earlier was in focusing on making myself a "faster horse" as Henry Ford used to joke about customers wanting before he showed them what they should really want (a Model T). That is, something that would make my experience nicer while doing this agent swarm based development workflow. But the obvious lesson is that you should make all your tooling agent-first because the agents are just better at this stuff. You can still watch, and of course I did add a ridiculous number of very nice human-centric features to ntm that you'll be seeing in the next day or two, but those are really kind of "for fun" to make us humans feel better about the process. All the real value-add is happening "by agents, for agents." PS: Towards the end, you can see me switch to my Mac and tell Claude to improve the skill that I made earlier today for taking the mkv screen recording files from OBS Studio and muxing them into MP4 files for sharing, while downloading songs from YouTube to serve as the background music. I made it so it can also grab the thumbnails and generate little song credit cards that show up in the lower right corner. This worked perfectly the first time! I'll include some screenshots in a response post showing how that worked, but it was awesome to witness. Skills are POWERFUL. I'll also post a link to this video on YouTube if you prefer to watch it there.

Jeffrey Emanuel

25,483 次观看 • 6 个月前

It has been an extremely active 24 hours for FrankenTUI development, with countless new features and functionality added (including massive numbers of tests and verifications). Have you ever seen a terminal like this? Because I know I haven't! Can't wait to start building stuff!

It has been an extremely active 24 hours for FrankenTUI development, with countless new features and functionality added (including massive numbers of tests and verifications). Have you ever seen a terminal like this? Because I know I haven't! Can't wait to start building stuff!

Jeffrey Emanuel

21,778 次观看 • 5 个月前

I finally got around to dogfooding my own ntm tool for managing a bunch of agents, and... it's not bad? I made an ntm skill for Claude Code and then simply told it to use ntm with 10 CCs and 5 Codex instances on a new project that I just finished making the beads for. Easy.😎

I finally got around to dogfooding my own ntm tool for managing a bunch of agents, and... it's not bad? I made an ntm skill for Claude Code and then simply told it to use ntm with 10 CCs and 5 Codex instances on a new project that I just finished making the beads for. Easy.😎

Jeffrey Emanuel

17,578 次观看 • 6 个月前

OK, how cool is this? FrankenTUI is now totally web native, with ultra high-performance wasm. Totally side-stepping xterm .js with my own home-grown FrankenTermJS. And it's all touch-native, works perfectly on iPhone, at full 60 fps. And also a resizable React widget. Try it here:

OK, how cool is this? FrankenTUI is now totally web native, with ultra high-performance wasm. Totally side-stepping xterm .js with my own home-grown FrankenTermJS. And it's all touch-native, works perfectly on iPhone, at full 60 fps. And also a resizable React widget. Try it here:

Jeffrey Emanuel

10,463 次观看 • 5 个月前

没有更多内容可加载