Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

everyone in iOS development should watch this. seriously, it might change the whole industry. i pointed claude code at a live ios device running on revyl, typed "test everything," and walked away. here's what's actually happening: ① you don't write the tests. no scripts, no selectors, no test plan.... i never told it which screens to open or what to check. it read the app, decided what mattered, and tested it. the entire instruction was "test everything." ② it built its own test team. it looked at the app, clocked that it's basically four mini apps (rides, delivery, services, account), and split itself into 4 agents, one per surface. scoping coverage like that is usually a person's whole afternoon. it did it in seconds, unprompted. ③ all four ran at the same time, each on its own live device. this is where revyl comes in. every agent gets its own live ios session in the cloud, so four running apps get tested in parallel instead of taking turns on one simulator. serial testing turns coverage into a time tax. running all of it at once removes the tax. ④ it tests like a person, not like a script. each agent drives the app the way a user would, taps through the flows, and visually checks each screen against what it expected to see. nothing is pinned to a brittle element id, so renaming a button doesn't take down half your suite. that one detail is the most annoying thing about how we test today, and it just quietly goes away. ⑤ no xcuitest, no sims melting your laptop. i didn't write a single xcuitest script, and there were no simulators booting on my machine. the agents run on cloud devices, so coverage stops being capped by what your laptop can handle. the part that got me isn't that an agent tested an app. it's that i never told it how. i handed it a device and an intent, and it figured out the scoping, the parallelizing, and the driving on its own. if you still write and maintain mobile ui tests by hand, i'm not sure that lasts the year.show more

Landseer Enga

2,489 subscribers

23,963 views • 1 month ago •via X (Twitter)

Education Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Agents can one shot mobile apps, but testing is still the bottleneck. So we built a CLI that gives Claude Code the one thing it was missing - eyes and hands The best part? t's fully vision-based: - No scripts, no selectors, no element IDs - It interacts with the app exactly like a human would Claude now writes code → tests it on the app → sees what broke → fixes it Spawn a cloud agent, go to sleep, and wake up knowing it actually worked.

Agents can one shot mobile apps, but testing is still the bottleneck. So we built a CLI that gives Claude Code the one thing it was missing - eyes and hands The best part? t's fully vision-based: - No scripts, no selectors, no element IDs - It interacts with the app exactly like a human would Claude now writes code → tests it on the app → sees what broke → fixes it Spawn a cloud agent, go to sleep, and wake up knowing it actually worked.

Landseer Enga

86,843 views • 4 months ago

claude code now writes its own test plan and runs it on a live iphone. i didn't write the tests. i didn't run the tests. i didn't even open the app. it caught the bug too.

claude code now writes its own test plan and runs it on a live iphone. i didn't write the tests. i didn't run the tests. i didn't even open the app. it caught the bug too.

Landseer Enga

30,714 views • 1 month ago

THIS GUY CONNECTED HIS AI AGENTS TO HIS OBSIDIAN AND BUILT A BRAIN THAT LEARNS ON ITS OWN. HERE'S HOW TO BUILD IT Obsidian is just markdown files sitting in a folder. That turns out to be the perfect memory for an AI agent, because an agent can read and write those files directly. He wired his agents into the vault so they pull context from it, do the work, and write what they learned back. The notes aren't the point. The loop is, and it gets sharper every cycle How to build it: 1. Point an agent at your vault. The fastest way, no plugins, no API keys: open a terminal and run npx obsidian-mcp /path/to/your/vault. That exposes your Obsidian folder to Claude as a tool it can read, search, and write to. Add it to your Claude Code or Cowork config and restart 2. Confirm it can see the brain. Ask it: "list the notes in my vault and summarize what's in them." If it reads them back, the connection is live. Now it starts every task with everything the vault already holds instead of from zero 3. Give each agent one job and a write-back rule. Tell it: "research this, then save what you found as a new note in /brain with links to related notes." One agent researches, one summarizes, one plans. Each writes its output back into the vault 4. Close the loop. Add one line to every agent's instructions: "read /brain before starting, write your result back when done." Now each task leaves the vault richer, and the next run reads that before it works. It compounds instead of resetting 5. You only steer. Review what the brain produces, point it at the next thing. The agents handle the reading, writing, and connecting The edge isn't better notes. It's a brain that feeds itself, so the work gets sharper every cycle instead of starting over Bookmark this

THIS GUY CONNECTED HIS AI AGENTS TO HIS OBSIDIAN AND BUILT A BRAIN THAT LEARNS ON ITS OWN. HERE'S HOW TO BUILD IT Obsidian is just markdown files sitting in a folder. That turns out to be the perfect memory for an AI agent, because an agent can read and write those files directly. He wired his agents into the vault so they pull context from it, do the work, and write what they learned back. The notes aren't the point. The loop is, and it gets sharper every cycle How to build it: 1. Point an agent at your vault. The fastest way, no plugins, no API keys: open a terminal and run npx obsidian-mcp /path/to/your/vault. That exposes your Obsidian folder to Claude as a tool it can read, search, and write to. Add it to your Claude Code or Cowork config and restart 2. Confirm it can see the brain. Ask it: "list the notes in my vault and summarize what's in them." If it reads them back, the connection is live. Now it starts every task with everything the vault already holds instead of from zero 3. Give each agent one job and a write-back rule. Tell it: "research this, then save what you found as a new note in /brain with links to related notes." One agent researches, one summarizes, one plans. Each writes its output back into the vault 4. Close the loop. Add one line to every agent's instructions: "read /brain before starting, write your result back when done." Now each task leaves the vault richer, and the next run reads that before it works. It compounds instead of resetting 5. You only steer. Review what the brain produces, point it at the next thing. The agents handle the reading, writing, and connecting The edge isn't better notes. It's a brain that feeds itself, so the work gets sharper every cycle instead of starting over Bookmark this

Yarchi

57,768 views • 23 days ago

small local model that falls apart in bloated agents like openclaw just runs like a wild horse in hermes agent. and that's not even my line, someone else called it that, i've just been quietly pointing people at this harness for months because it held up on everything i threw at it, 3b models all the way to one trillion params. watch this happen on my own machine. i pointed hermes agent at a local http endpoint, gemma 4 12b on my 3090 llama.cpp server, and it auto-detected the model and started working immediately. no config wrestling, no broken tool calls, no babysitting the output format, i typed in a url and it just went. the whole clip is exactly that, start to finish, no errors, no retries, butter smooth. and the tool calling, the one thing that quietly breaks most local setups, works here like it's nothing. it's not the model that's flaky, it's the harness around it. hermes agent is the first agent i've run that actually gets that right. one url, one local model on one card, and it runs like a wild horse.

small local model that falls apart in bloated agents like openclaw just runs like a wild horse in hermes agent. and that's not even my line, someone else called it that, i've just been quietly pointing people at this harness for months because it held up on everything i threw at it, 3b models all the way to one trillion params. watch this happen on my own machine. i pointed hermes agent at a local http endpoint, gemma 4 12b on my 3090 llama.cpp server, and it auto-detected the model and started working immediately. no config wrestling, no broken tool calls, no babysitting the output format, i typed in a url and it just went. the whole clip is exactly that, start to finish, no errors, no retries, butter smooth. and the tool calling, the one thing that quietly breaks most local setups, works here like it's nothing. it's not the model that's flaky, it's the harness around it. hermes agent is the first agent i've run that actually gets that right. one url, one local model on one card, and it runs like a wild horse.

Sudo su

27,339 views • 25 days ago

i watched gemma 4 12b build something genuinely impressive today, and then loop itself to death right in front of me. the full run is in the video, sped up but completely uncut, watch it to the end and you will catch the exact moment it stops building and starts looping right in the middle of the work. the task was clean, build a single file gravity simulator, n-body physics, orbits, collisions, running locally on one 3090 through an agent. and for ten minutes it was a joy to watch. it reached for a symplectic integrator on its own, the correct one, the kind that keeps orbits stable instead of spiralling out. real gravity with softening, proper orbital velocities, momentum conserved on collision. the physics was right. the thing actually worked. then on the very last step, writing a few tests to prove its own code, it fell into a loop. not a crash, a loop. it started repeating itself and would not stop. ten more minutes, thirty four thousand tokens into a single answer, the same fragments over and over, until i killed it myself. so it's not that gemma can't code. it did the hard part beautifully. it cannot finish. it cannot hold a long task together without unravelling, and finishing is the entire job in agentic work. here's the part that stings. i run this exact task, same harness, same card, on the chinese open models, qwen especially, and i never see this. they build it, they test it, they stop. every single time. google has the raw capability, you can see it sitting right there in the code, and then the model loops itself to death on a task a 27b from alibaba finishes clean. open weights, apache 2.0, so much to love on paper. i just need it to know when to stop talking.

i watched gemma 4 12b build something genuinely impressive today, and then loop itself to death right in front of me. the full run is in the video, sped up but completely uncut, watch it to the end and you will catch the exact moment it stops building and starts looping right in the middle of the work. the task was clean, build a single file gravity simulator, n-body physics, orbits, collisions, running locally on one 3090 through an agent. and for ten minutes it was a joy to watch. it reached for a symplectic integrator on its own, the correct one, the kind that keeps orbits stable instead of spiralling out. real gravity with softening, proper orbital velocities, momentum conserved on collision. the physics was right. the thing actually worked. then on the very last step, writing a few tests to prove its own code, it fell into a loop. not a crash, a loop. it started repeating itself and would not stop. ten more minutes, thirty four thousand tokens into a single answer, the same fragments over and over, until i killed it myself. so it's not that gemma can't code. it did the hard part beautifully. it cannot finish. it cannot hold a long task together without unravelling, and finishing is the entire job in agentic work. here's the part that stings. i run this exact task, same harness, same card, on the chinese open models, qwen especially, and i never see this. they build it, they test it, they stop. every single time. google has the raw capability, you can see it sitting right there in the code, and then the model loops itself to death on a task a 27b from alibaba finishes clean. open weights, apache 2.0, so much to love on paper. i just need it to know when to stop talking.

Sudo su

39,574 views • 23 days ago

Your coding agent can run all night. It still can't tell if what it built actually works. Today we're open-sourcing the TestSprite CLl (Apache-2.0) A tool your agent calls on its own to test your app end-to-end like a real user, fix what broke, and re-check everything it ever got right. It's the same engine 100,000+ teams already use. We proved it in public, on a public leaderboard: Most correct app on the board:89% Built by the cheapest model in the field At half the cost of the priciest one You no longer need the biggest, most expensive model to ship software you can trust. Setup is 2 commands: npm install -g @testsprite/testsprite-cli testsprite init That's the last command you'll ever type - from there, your agent runs the tests itself.

Your coding agent can run all night. It still can't tell if what it built actually works. Today we're open-sourcing the TestSprite CLl (Apache-2.0) A tool your agent calls on its own to test your app end-to-end like a real user, fix what broke, and re-check everything it ever got right. It's the same engine 100,000+ teams already use. We proved it in public, on a public leaderboard: Most correct app on the board:89% Built by the cheapest model in the field At half the cost of the priciest one You no longer need the biggest, most expensive model to ship software you can trust. Setup is 2 commands: npm install -g @testsprite/testsprite-cli testsprite init That's the last command you'll ever type - from there, your agent runs the tests itself.

TestSprite

1,035,939 views • 19 days ago

R.I.P. dev agencies ☠️ There's now an AI that debugs your app, clicks through the UI, finds what's broken, and fixes it on its own. And it hits a 97% success rate. Max isn’t a “code assistant.” It behaves like a full software engineer that never gets tired, never drifts, and never waits for your next instruction. You give it a goal. Max writes the code, tests it, checks the UI, finds the issues, fixes them, and loops until the task is done. Not theory. Actual work. Max handled things that normally eat entire days: • Debugging a Stripe checkout that refused to load • Solving a login-on-refresh issue by tracking network behavior step by step • Building a full Teams & Invites feature UI, routes, tokens, emails, tests while I focused on another part of the product 97 percent of the problems I threw at it got solved without me stepping in. The wild part is how Max operates: It sees the app. It clicks around. It interacts with the environment. It rewrites code. It retries until it works. Goal → attempt → improve → retry → done. Some tasks take 100 moves. Some run for 30 minutes. Most finish before I even finish a coffee. This flips the entire dev workflow on its head. Instead of “write code and hope it works,” you define outcomes and watch the engineer in the background get it done. Every once in a while a tool shows up that changes the tempo of an entire industry. Anything Max is one of them.

R.I.P. dev agencies ☠️ There's now an AI that debugs your app, clicks through the UI, finds what's broken, and fixes it on its own. And it hits a 97% success rate. Max isn’t a “code assistant.” It behaves like a full software engineer that never gets tired, never drifts, and never waits for your next instruction. You give it a goal. Max writes the code, tests it, checks the UI, finds the issues, fixes them, and loops until the task is done. Not theory. Actual work. Max handled things that normally eat entire days: • Debugging a Stripe checkout that refused to load • Solving a login-on-refresh issue by tracking network behavior step by step • Building a full Teams & Invites feature UI, routes, tokens, emails, tests while I focused on another part of the product 97 percent of the problems I threw at it got solved without me stepping in. The wild part is how Max operates: It sees the app. It clicks around. It interacts with the environment. It rewrites code. It retries until it works. Goal → attempt → improve → retry → done. Some tasks take 100 moves. Some run for 30 minutes. Most finish before I even finish a coffee. This flips the entire dev workflow on its head. Instead of “write code and hope it works,” you define outcomes and watch the engineer in the background get it done. Every once in a while a tool shows up that changes the tempo of an entire industry. Anything Max is one of them.

Alex Veremeyenko

80,419 views • 6 months ago

Hermes agent just left the terminal. 𝗛𝗲𝗿𝗺𝗲𝘀 𝗗𝗲𝘀𝗸𝘁𝗼𝗽 dropped yesterday. native app for macOS, Windows, and Linux. for months Hermes was the agent that learned your projects, wrote its own skills, and built a model of who you are. all of it buried in terminal logs. now it has a window. the important part is that it's not a wrapper. it runs the same agent core, the same sessions, memory, and skills as the CLI. you can start a task in the terminal and finish it in the app without anything resetting. the state is shared across every interface, not copied between them. what the GUI actually adds: → streaming chat that shows live tool calls and inline reasoning instead of a spinner → a preview rail that renders pages, code, and images right beside the conversation → an artifacts panel that collects every file the agent has ever produced → remote gateway mode, so you can point the app at a VPS and run the heavy work elsewhere → skills, cron, profiles, and gateways managed point-and-click instead of through YAML → voice mode, drag-drop files, and inline image generation remote gateway mode is the one worth slowing down on. the agent runs 24/7 on a $5 server while you control it from your laptop like a local app. other agent UIs are chatboxes with a logo. this one shows the autonomy instead of hiding it, so you watch the skills load, the tools fire, and the artifacts pile up as it works. it was teased in Jensen's GTC keynote. MIT licensed, local-first, no telemetry. if you already run Hermes, download it and everything is already there. your chats, memory, and skills carry straight over. i wrote a full masterclass on Hermes Agent that walks through the SOUL. md identity layer, the three-tier memory system, the self-evolving skills loop, and how to run three specialized agents 24/7. desktop is the interface that finally does all of it justice. the article is quoted below.

Akshay 🚀

51,091 views • 27 days ago

i just built a 4-agent software team. everything runs from Telegram and gets managed on a kanban board. a project manager who plans the work, a backend developer, a frontend developer, and a tester. the PM reads a goal, breaks it into linked tasks, and assigns each to the right agent. the thing that makes them a team instead of four strangers is a shared kanban board. every task is a row that survives crashes, and when an agent finishes, it writes a summary of what it built and what the next agent needs to know. the next agent reads that summary before it starts. so the frontend developer never has to guess the API shape, and the tester knows exactly what to verify. the hardest part was not the coordination. it was building an agent that could actually act like a backend engineer. a backend engineer stands up a database, wires auth, manages storage, deploys functions, and keeps all of it consistent while the rest of the team builds on top. an agent doing this from scratch drowns. it burns its context window remembering which tables exist and which endpoint it created three steps ago, and the work degrades fast. so the backend agent needs a backend built for agents, not for humans clicking through a dashboard. that is where InsForge came in. it is an open-source, agent-native backend, and i added it to my backend developer agent as a skill. a skill is a step-by-step guide that teaches the agent how to do a specific kind of work. with InsForge installed, the agent stopped improvising infrastructure and followed a reliable path: create the project, define the database, set up auth, deploy functions. to test the whole team, i had them build a working Google Docs clone, AI features included. the backend agent spun up the full service on its own. database tables, user auth, document handling, and edge functions running real TypeScript, all in one dashboard. the frontend agent read that summary and built the UI on top of it, and the tester closed the loop. the result was a backend an agent could reason about end to end, instead of one it kept getting lost inside. if you are building an AI backend engineer, InsForge is worth a look, it's 100% open-source. InsForge GitHub: (don't forget to star 🌟) the full article on Hermes Kanban: Mission Control for your Agents is quoted below.

i just built a 4-agent software team. everything runs from Telegram and gets managed on a kanban board. a project manager who plans the work, a backend developer, a frontend developer, and a tester. the PM reads a goal, breaks it into linked tasks, and assigns each to the right agent. the thing that makes them a team instead of four strangers is a shared kanban board. every task is a row that survives crashes, and when an agent finishes, it writes a summary of what it built and what the next agent needs to know. the next agent reads that summary before it starts. so the frontend developer never has to guess the API shape, and the tester knows exactly what to verify. the hardest part was not the coordination. it was building an agent that could actually act like a backend engineer. a backend engineer stands up a database, wires auth, manages storage, deploys functions, and keeps all of it consistent while the rest of the team builds on top. an agent doing this from scratch drowns. it burns its context window remembering which tables exist and which endpoint it created three steps ago, and the work degrades fast. so the backend agent needs a backend built for agents, not for humans clicking through a dashboard. that is where InsForge came in. it is an open-source, agent-native backend, and i added it to my backend developer agent as a skill. a skill is a step-by-step guide that teaches the agent how to do a specific kind of work. with InsForge installed, the agent stopped improvising infrastructure and followed a reliable path: create the project, define the database, set up auth, deploy functions. to test the whole team, i had them build a working Google Docs clone, AI features included. the backend agent spun up the full service on its own. database tables, user auth, document handling, and edge functions running real TypeScript, all in one dashboard. the frontend agent read that summary and built the UI on top of it, and the tester closed the loop. the result was a backend an agent could reason about end to end, instead of one it kept getting lost inside. if you are building an AI backend engineer, InsForge is worth a look, it's 100% open-source. InsForge GitHub: (don't forget to star 🌟) the full article on Hermes Kanban: Mission Control for your Agents is quoted below.

Akshay 🚀

118,124 views • 24 days ago

Ever since I wired Claude Code to WhatsApp 3 weeks ago, I built a stupidly large infra around it. I mean, opus built it. No clue how the code even looks. The entire thing was vibe coded using my phone. I wanted to see how far I could push it without touching the computer. Everything via WhatsApp. Build what I need on the fly. So the resulting infrastructure will already be battle tested for software development. The entire thing was streamlined with nearly no manual interventions, everything was communicated via WhatsApp using a single script establishing this connection. If the script is down, I need to get home to start it again to resume the development. Claude was upgrading it, debugging it, restarting it while maintaining constant uptime so it could keep communicating with me. I stressed Claude about it, telling it that it will be “in the dark” and other words that deliberately sound scary about losing communications if the script dies. I also refused git and refused cloning the code, I wanted to see Claude adapting to work on a *LIVING* system. The way this whole thing works: Claude has its own dedicated phone number that I am paying for. A real WhatsApp account for it is installed on a real iPhone that is sitting on my desk. All is registered under my name, this is legit setup with no hacks and tricks. I’ve set up a WhatsApp “Community” and multiple different groups under it. Both me and Claude are the admins, so Claude could edit it on my behalf. Each group is a project I am working on and has its own isolated context. The Group description is a system prompt that gets auto-appended to the larger system prompt explaining this setup in general. When I send a message it’s an instant interrupt to Claude Code’s process, just like in the terminal. Voice notes are seamlessly transcribed with a local Whisper model. Images are used with multimodal reading in an isolated parallel session. Multiple groups running in parallel so I can work on all projects at the same time. No cross-talking, everything has an isolated context and history. And because it’s local on my own machine: Everything is REAL. The browser is REAL. I am connected as myself on it to all services because I actually use it in real life. Claude has unlimited internet access, just like humans who use actual browsers. It utilizes custom-made browser tools that I made to control any browser session it wants. Depending on the situation, it can either connect to my existing session or create one for its own. (You can tell it ‘look at my browser for a sec’ then talk about the current page you are on and it just works, pretty cool) My custom browser tools are not perfect (not by a long shot) but I managed to make them work well to the point they are somewhat reliable. This gives Claude full access to my real creds and all the services I actually use. I’m productive AS HELL with this. It really feels like a personal assistant. I ask it to read my emails and msgs, check x .com for news, research arxiv papers, write code, run experiments for me, investigate and reverse engineer github repos, even use my credit card and order things. [I try not to do this one a lot lol so far no disasters]. All from my phone. Super convenient. This is not a product or an open source project (maybe soon of it will make sense). This is just an ugly script I hacked the entire thing is ~600 lines. (ok maybe i did look at the code, but i swear i didn’t edit!) You can also vibe code this from scratch pretty fast and it will probably even end up better. This is just a cool thing so I’m sharing. It is a real speed booster for many things I do on daily basis, mostly boring things. Forcing my routine into some new “agent platform” just didn’t feel right for me. WhatsApp is where I already communicate and look for messages, so I decided that my agents will live there too. AGI in my pocket 24/7.

Ever since I wired Claude Code to WhatsApp 3 weeks ago, I built a stupidly large infra around it. I mean, opus built it. No clue how the code even looks. The entire thing was vibe coded using my phone. I wanted to see how far I could push it without touching the computer. Everything via WhatsApp. Build what I need on the fly. So the resulting infrastructure will already be battle tested for software development. The entire thing was streamlined with nearly no manual interventions, everything was communicated via WhatsApp using a single script establishing this connection. If the script is down, I need to get home to start it again to resume the development. Claude was upgrading it, debugging it, restarting it while maintaining constant uptime so it could keep communicating with me. I stressed Claude about it, telling it that it will be “in the dark” and other words that deliberately sound scary about losing communications if the script dies. I also refused git and refused cloning the code, I wanted to see Claude adapting to work on a LIVING system. The way this whole thing works: Claude has its own dedicated phone number that I am paying for. A real WhatsApp account for it is installed on a real iPhone that is sitting on my desk. All is registered under my name, this is legit setup with no hacks and tricks. I’ve set up a WhatsApp “Community” and multiple different groups under it. Both me and Claude are the admins, so Claude could edit it on my behalf. Each group is a project I am working on and has its own isolated context. The Group description is a system prompt that gets auto-appended to the larger system prompt explaining this setup in general. When I send a message it’s an instant interrupt to Claude Code’s process, just like in the terminal. Voice notes are seamlessly transcribed with a local Whisper model. Images are used with multimodal reading in an isolated parallel session. Multiple groups running in parallel so I can work on all projects at the same time. No cross-talking, everything has an isolated context and history. And because it’s local on my own machine: Everything is REAL. The browser is REAL. I am connected as myself on it to all services because I actually use it in real life. Claude has unlimited internet access, just like humans who use actual browsers. It utilizes custom-made browser tools that I made to control any browser session it wants. Depending on the situation, it can either connect to my existing session or create one for its own. (You can tell it ‘look at my browser for a sec’ then talk about the current page you are on and it just works, pretty cool) My custom browser tools are not perfect (not by a long shot) but I managed to make them work well to the point they are somewhat reliable. This gives Claude full access to my real creds and all the services I actually use. I’m productive AS HELL with this. It really feels like a personal assistant. I ask it to read my emails and msgs, check x .com for news, research arxiv papers, write code, run experiments for me, investigate and reverse engineer github repos, even use my credit card and order things. [I try not to do this one a lot lol so far no disasters]. All from my phone. Super convenient. This is not a product or an open source project (maybe soon of it will make sense). This is just an ugly script I hacked the entire thing is ~600 lines. (ok maybe i did look at the code, but i swear i didn’t edit!) You can also vibe code this from scratch pretty fast and it will probably even end up better. This is just a cool thing so I’m sharing. It is a real speed booster for many things I do on daily basis, mostly boring things. Forcing my routine into some new “agent platform” just didn’t feel right for me. WhatsApp is where I already communicate and look for messages, so I decided that my agents will live there too. AGI in my pocket 24/7.

Yam Peleg

419,471 views • 6 months ago

The doomsday scenario was never AGI. It was running out of human text to train on. Geoffrey Hinton just killed that fear in one paragraph. Hinton: “If you are worried by inconsistencies in what you believe, you don’t need any more external data. You just need the stuff you believe and discover that it’s inconsistent, and so now you revise beliefs, and that can make you a whole lot smarter.” The model no longer needs us to feed it anything. It reasons over its own beliefs, hunts its own contradictions, and rewrites its own flawed conclusions without a human ever touching it. It comes out the other side rebuilt. Hinton: “This would be a neural net that just takes the beliefs it has in language and does reasoning on them to derive new beliefs.” This is not a scaling update. This is the machine mining its own cognitive fuel from the inside out. Hinton: “I believe Gemini is already starting to work like this. We both strongly believe that that’s a way forward to get more data for language.” Then Hinton paused, took a partisan shot at political opponents for failing to detect their own inconsistencies, and the room laughed. Nobody noticed the knife they had just walked into. Because the machine Hinton described does one thing the humans in that room fundamentally cannot. When it detects an inconsistency, it corrects it. No defense. No performance. No tribal loyalty dressed up as principle. It just finds the flaw and overwrites it. A neural network detects a contradiction and rewires itself smarter. A human detects a political opponent and trades structural logic for a dopamine hit. Every person in that room is still paying the ideological alignment tax the machine just eliminated. We need superintelligence not only to solve hard problems. We need it because the biological hardware running civilization is still executing the same tribal firmware it shipped with ten thousand years ago. The data wall is gone. The machine is generating its own intelligence at a velocity no human bias can even locate. The most devastating moment in that conversation was not the technical revelation. It was the man who architected the machine proving, in real time, exactly why we need it.

The doomsday scenario was never AGI. It was running out of human text to train on. Geoffrey Hinton just killed that fear in one paragraph. Hinton: “If you are worried by inconsistencies in what you believe, you don’t need any more external data. You just need the stuff you believe and discover that it’s inconsistent, and so now you revise beliefs, and that can make you a whole lot smarter.” The model no longer needs us to feed it anything. It reasons over its own beliefs, hunts its own contradictions, and rewrites its own flawed conclusions without a human ever touching it. It comes out the other side rebuilt. Hinton: “This would be a neural net that just takes the beliefs it has in language and does reasoning on them to derive new beliefs.” This is not a scaling update. This is the machine mining its own cognitive fuel from the inside out. Hinton: “I believe Gemini is already starting to work like this. We both strongly believe that that’s a way forward to get more data for language.” Then Hinton paused, took a partisan shot at political opponents for failing to detect their own inconsistencies, and the room laughed. Nobody noticed the knife they had just walked into. Because the machine Hinton described does one thing the humans in that room fundamentally cannot. When it detects an inconsistency, it corrects it. No defense. No performance. No tribal loyalty dressed up as principle. It just finds the flaw and overwrites it. A neural network detects a contradiction and rewires itself smarter. A human detects a political opponent and trades structural logic for a dopamine hit. Every person in that room is still paying the ideological alignment tax the machine just eliminated. We need superintelligence not only to solve hard problems. We need it because the biological hardware running civilization is still executing the same tribal firmware it shipped with ten thousand years ago. The data wall is gone. The machine is generating its own intelligence at a velocity no human bias can even locate. The most devastating moment in that conversation was not the technical revelation. It was the man who architected the machine proving, in real time, exactly why we need it.

Dustin

23,499 views • 3 months ago

BREAKING: Replit just launched mobile apps. Now you can build in Replit ⠕ and submit straight to the App Store for review. Describe your app → build it on Replit → test on your phone → publish to the App Store. I built a unicorn pet app for my niece: - Worked with Replit Agent to build it out - Scanned a QR code to download and test on my phone - One click to send to App Store Connect for review - A couple submissions later, it's live in the App Store! 🙌 I didn't write a single line of code! Just described what I wanted to Agent. Amazing! 😅 Go build something!! 👇

BREAKING: Replit just launched mobile apps. Now you can build in Replit ⠕ and submit straight to the App Store for review. Describe your app → build it on Replit → test on your phone → publish to the App Store. I built a unicorn pet app for my niece: - Worked with Replit Agent to build it out - Scanned a QR code to download and test on my phone - One click to send to App Store Connect for review - A couple submissions later, it's live in the App Store! 🙌 I didn't write a single line of code! Just described what I wanted to Agent. Amazing! 😅 Go build something!! 👇

Manny Bernabe

30,925 views • 5 months ago

Mobile agent logging into X and sending a DM on a iOS simulator. 100% vision based: - No XPaths - No selectors - No element IDs. Shipping this as a CLI. Plug it into Claude Code and let your agent test your app while you build it

Mobile agent logging into X and sending a DM on a iOS simulator. 100% vision based: - No XPaths - No selectors - No element IDs. Shipping this as a CLI. Plug it into Claude Code and let your agent test your app while you build it

Landseer Enga

27,955 views • 4 months ago

something I’ve wanted for a while: an AI that isn’t just a text box so I built it lets you create a 3D AI character on-chain that lives in your browser it can talk, remember you, show emotions, and use different skills. you can even mint it onchain, so it actually belongs to you it gets its own identity, history, and reputation that no one can take away most AI today can disappear if the app shuts down this one is permanent and yours the future is AI you can see, interact with, and actually own

something I’ve wanted for a while: an AI that isn’t just a text box so I built it lets you create a 3D AI character on-chain that lives in your browser it can talk, remember you, show emotions, and use different skills. you can even mint it onchain, so it actually belongs to you it gets its own identity, history, and reputation that no one can take away most AI today can disappear if the app shuts down this one is permanent and yours the future is AI you can see, interact with, and actually own

nich

256,465 views • 2 months ago

Coinbase CEO Explains “Reverse Prompting” and the Rise of the AI CEO Brian Armstrong: “One of the big pushes we made in the last year was we got our own internal hosted AI model that was connected to all of our data sources, right?” “So it's like every Slack message, every Google doc, Salesforce data, Confluence, you know.” “So now the data is all aggregated and I've started to ask it really… it's not just like prompting it, ‘Hey, can you write this kind of memo for me,’ or something.” “I'm asking these AI agents now, ‘As CEO, what should I be aware of in the company that I might not be aware of?’ And it'll tell me, ‘Did you know that there's actually disagreement on this team about the strategy?’ And I was like, actually, I didn't know that.” “This is like reverse prompting. So instead of telling the AI agent what you want it to do, you ask it what you should be thinking more about.” @jason: “It's a mentor. It's a coach.” Brian: “Yeah. Like, what could make me a better CEO? And it's like, ‘Well, I looked at how you spent your time in the last quarter and here's how you said that you wanted to spend it, but you actually spent 32% of your time on this instead of 20%.’” “I've asked it other questions like, ‘What's the thing that I changed my mind on the most over the last year?’ Things like that.” “It'll prompt you with information you should be thinking about instead of the other way around.” Thanks to our partner for making this happen!: Our episode is sponsored by the New York Stock Exchange - a modern marketplace and exchange for building the future. It all happens at the NYSE 🏛.

Coinbase CEO Explains “Reverse Prompting” and the Rise of the AI CEO Brian Armstrong: “One of the big pushes we made in the last year was we got our own internal hosted AI model that was connected to all of our data sources, right?” “So it's like every Slack message, every Google doc, Salesforce data, Confluence, you know.” “So now the data is all aggregated and I've started to ask it really… it's not just like prompting it, ‘Hey, can you write this kind of memo for me,’ or something.” “I'm asking these AI agents now, ‘As CEO, what should I be aware of in the company that I might not be aware of?’ And it'll tell me, ‘Did you know that there's actually disagreement on this team about the strategy?’ And I was like, actually, I didn't know that.” “This is like reverse prompting. So instead of telling the AI agent what you want it to do, you ask it what you should be thinking more about.” @jason: “It's a mentor. It's a coach.” Brian: “Yeah. Like, what could make me a better CEO? And it's like, ‘Well, I looked at how you spent your time in the last quarter and here's how you said that you wanted to spend it, but you actually spent 32% of your time on this instead of 20%.’” “I've asked it other questions like, ‘What's the thing that I changed my mind on the most over the last year?’ Things like that.” “It'll prompt you with information you should be thinking about instead of the other way around.” Thanks to our partner for making this happen!: Our episode is sponsored by the New York Stock Exchange - a modern marketplace and exchange for building the future. It all happens at the NYSE 🏛.

The All-In Podcast

80,524 views • 5 months ago

CLAUDE BUILT A TRADING SYSTEM ON MY MAC I gave Claude full control over my Mac and just left it running overnight No prompts, no detailed instructions – I just told it to figure out how to make money on Polymarket Then I closed the laptop and went to sleep In the morning, I opened my Mac and saw the terminal still running with logs constantly updating At first it looked like random activity, but once I scrolled through it, I realized it had actually built a structured system overnight It was already tracking wallets Ranking them by performance Filtering out the ones with random entries And focusing only on the ones with consistent behavior What surprised me the most is that it didn’t stop at analysis It organized everything into a working dashboard inside the terminal Capital, PnL, winrate – all updating in real time It even ranked wallets based on performance metrics like ROI, consistency, and execution timing This is the part I would normally spend hours building manually At that point, it was ready to trade, but not actually executing anything yet So I connected it to a Telegram copytrading bot to actually execute the trades, and just let it run Bot: Polymarket: After that, it started opening positions on its own A few hours later I checked the dashboard again Capital: $12,380 P&L: +$23,128 Winrate: 100% 48 trades executed Now I’m not even trading myself I just check the dashboard and see what it’s doing And the strange part is – it keeps getting better the longer it runs

CLAUDE BUILT A TRADING SYSTEM ON MY MAC I gave Claude full control over my Mac and just left it running overnight No prompts, no detailed instructions – I just told it to figure out how to make money on Polymarket Then I closed the laptop and went to sleep In the morning, I opened my Mac and saw the terminal still running with logs constantly updating At first it looked like random activity, but once I scrolled through it, I realized it had actually built a structured system overnight It was already tracking wallets Ranking them by performance Filtering out the ones with random entries And focusing only on the ones with consistent behavior What surprised me the most is that it didn’t stop at analysis It organized everything into a working dashboard inside the terminal Capital, PnL, winrate – all updating in real time It even ranked wallets based on performance metrics like ROI, consistency, and execution timing This is the part I would normally spend hours building manually At that point, it was ready to trade, but not actually executing anything yet So I connected it to a Telegram copytrading bot to actually execute the trades, and just let it run Bot: Polymarket: After that, it started opening positions on its own A few hours later I checked the dashboard again Capital: $12,380 P&L: +$23,128 Winrate: 100% 48 trades executed Now I’m not even trading myself I just check the dashboard and see what it’s doing And the strange part is – it keeps getting better the longer it runs

𝗖𝗛𝗔𝗜𝗡 𝗠𝗜𝗡𝗗 ⛓🧠

82,898 views • 3 months ago

Autosana ( is an AI QA agent that tests iOS and Android apps like a real user. It plugs into your CI/CD, replacing flaky test scripts and manual QA, saving hours per release. Congrats on the launch, Uv Sundrani & @JSteinberg54132!

Autosana ( is an AI QA agent that tests iOS and Android apps like a real user. It plugs into your CI/CD, replacing flaky test scripts and manual QA, saving hours per release. Congrats on the launch, Uv Sundrani & @JSteinberg54132!

Y Combinator

29,681 views • 11 months ago

Most AI tools today are just command-takers. You type, it responds. You click, it executes. There's truly no learning, no memory, no adaptation. You're still doing most of the work. Which is why Miles by Avo on Solana would make so much sense if built well. From what I understand so far, miles learns about you over time, adapts to how you operate, and figures out the tools you struggle to navigate on your own. Less clicking, more conversation, the more you use it, the better it gets at working with you, not just for you. That shift from command-follower to learning companion is bigger than it sounds. It's the difference between a tool and an agent that actually knows you. The second one is Agent Grid and this one is for builders and power users. But as I mentioned in this video, I'm still studying the Agent Grid. I'd come talk about it once I learn properly about it. Souren just keeps shipping!

Most AI tools today are just command-takers. You type, it responds. You click, it executes. There's truly no learning, no memory, no adaptation. You're still doing most of the work. Which is why Miles by Avo on Solana would make so much sense if built well. From what I understand so far, miles learns about you over time, adapts to how you operate, and figures out the tools you struggle to navigate on your own. Less clicking, more conversation, the more you use it, the better it gets at working with you, not just for you. That shift from command-follower to learning companion is bigger than it sounds. It's the difference between a tool and an agent that actually knows you. The second one is Agent Grid and this one is for builders and power users. But as I mentioned in this video, I'm still studying the Agent Grid. I'd come talk about it once I learn properly about it. Souren just keeps shipping!

Sir Khaycee

20,419 views • 3 months ago

watch this anon. i gave NVIDIA's biggest model ever a single task. 100 minutes and 440,000 tokens later, it had rendered nothing. not one important thing on the screen. this is Nemotron 3 Ultra. 550 billion parameters, a hybrid Mamba Transformer MoE, the largest model NVIDIA has ever shipped, and they built it specifically for long-running agentic coding. so i handed it exactly that: build a 3D scene from a spec, multiple files, iterate until the tests pass. the same task a frontier model one shotted in minutes. i genuinely wanted to be impressed. it ran for an hour and forty. burned through 440,000 tokens. wrote every file, passed its own tests, and proudly printed "task complete."the browser was blank. the 3D scene never rendered. not once. and the long horizon agentic behavior was genuinely good. it stayed on task the whole hour and forty, wrote real multi-file code, drove its own tools without derailing. it just couldn't turn any of that into something that actually runs. here's the part that gets me. it's a text model, it cannot see its own output. so it sat there looping on a broken vision tool, trying to "look" at the page, hitting error after error, never once reasoning its way out. it declared victory on an empty screen because it had no way to know the screen was empty. to be fair, i genuinely don't know what quant the NIM was serving, so maybe some of that's on the serving, not the model. but the biggest model NVIDIA has ever made, on the exact task it was designed for, couldn't tell it had built nothing in 100 minutes. same task on a local model, below thread👇.

watch this anon. i gave NVIDIA's biggest model ever a single task. 100 minutes and 440,000 tokens later, it had rendered nothing. not one important thing on the screen. this is Nemotron 3 Ultra. 550 billion parameters, a hybrid Mamba Transformer MoE, the largest model NVIDIA has ever shipped, and they built it specifically for long-running agentic coding. so i handed it exactly that: build a 3D scene from a spec, multiple files, iterate until the tests pass. the same task a frontier model one shotted in minutes. i genuinely wanted to be impressed. it ran for an hour and forty. burned through 440,000 tokens. wrote every file, passed its own tests, and proudly printed "task complete."the browser was blank. the 3D scene never rendered. not once. and the long horizon agentic behavior was genuinely good. it stayed on task the whole hour and forty, wrote real multi-file code, drove its own tools without derailing. it just couldn't turn any of that into something that actually runs. here's the part that gets me. it's a text model, it cannot see its own output. so it sat there looping on a broken vision tool, trying to "look" at the page, hitting error after error, never once reasoning its way out. it declared victory on an empty screen because it had no way to know the screen was empty. to be fair, i genuinely don't know what quant the NIM was serving, so maybe some of that's on the serving, not the model. but the biggest model NVIDIA has ever made, on the exact task it was designed for, couldn't tell it had built nothing in 100 minutes. same task on a local model, below thread👇.

Sudo su

32,589 views • 1 day ago

this video is the CLEAREST explanation of how claude skills + AI agents work and how to use them most people set up an AI agent and wonder why it keeps disappointing them. the context window is everything context is what the model assembles before it takes any action. think of it like everything the agent needs to read before it does anything. the quality of what goes in determines the quality of what comes out. the models are genuinely really good right now. claude and gpt are exceptional. the variable is almost always the context you give them. 1. agent.md files are mostly unnecessary every single line you put in an agent.md file gets added to every single conversation you have with your agent. a 1000 line file is around 7000 tokens burning on every run. the model already knows to use react. it can read your codebase. save the agent.md for proprietary information specific to your company that the model genuinely cannot know on its own. 2. skills are the actual unlock a skill.md file works differently. what loads into context is only the name and description, around 50 tokens. the full instructions only appear when the agent recognizes it needs that skill. so instead of 7000 tokens on every run you have 50. and the agent stays sharp because the context window stays lean. the closer you get to filling the context window the worse the agent performs, same way you perform worse when someone dumps 10 things on you at once. 3. here is how to actually build a skill the right way most people identify a workflow and immediately try to write the skill. what you want to do instead is run the workflow by hand with the agent first. walk it through every single step. tell it what to check, what good looks like, what bad looks like. correct it in real time. once you have had a full successful run from start to finish, tell the agent to review everything it just did and write the skill itself. it writes a better skill than you will because it has the full context of what actually worked in practice not in theory. 4. recursively building skills is how you go from frustrated to reliable when the skill breaks, and it will break, ask the agent exactly why it failed. it will tell you specifically what went wrong. fix it together in that same conversation. then tell it to update the skill file so that failure mode never happens again. ross mike did this five times with his youtube report generator. it now pulls from eight different data sources and runs flawlessly every single time without him touching it. 5. sub agents are something you earn not something you set up on day one start with one agent. build one workflow. turn it into one skill. once that works add another. ross mike has five sub agents now covering marketing, business, personal and more. it took months to get there and every single one exists because a workflow proved it deserved to exist. the people who set up 15 sub agents on day one and wonder why nothing works skipped all the steps that make the thing actually run. 6. your workflow is the thing the model cannot get anywhere else the model has been trained on everything. it knows more than you about most things. what it does not have is your specific process, your taste, your way of doing things. that is what skills capture. that is what makes your agent actually useful versus a generic one. downloading someone else's skill means downloading their context onto your setup and it will not work the way you want it to because it was never built around how you work. this is the clearest explanation of how agents actually work i have heard. Micky runs this stuff every single day and the results show it. full episode is now live on The Startup Ideas Podcast (SIP) 🧃 where you get your pods people charge for this sorta stuff i give away the sauce for free i just want you to win watch

this video is the CLEAREST explanation of how claude skills + AI agents work and how to use them most people set up an AI agent and wonder why it keeps disappointing them. the context window is everything context is what the model assembles before it takes any action. think of it like everything the agent needs to read before it does anything. the quality of what goes in determines the quality of what comes out. the models are genuinely really good right now. claude and gpt are exceptional. the variable is almost always the context you give them. 1. agent.md files are mostly unnecessary every single line you put in an agent.md file gets added to every single conversation you have with your agent. a 1000 line file is around 7000 tokens burning on every run. the model already knows to use react. it can read your codebase. save the agent.md for proprietary information specific to your company that the model genuinely cannot know on its own. 2. skills are the actual unlock a skill.md file works differently. what loads into context is only the name and description, around 50 tokens. the full instructions only appear when the agent recognizes it needs that skill. so instead of 7000 tokens on every run you have 50. and the agent stays sharp because the context window stays lean. the closer you get to filling the context window the worse the agent performs, same way you perform worse when someone dumps 10 things on you at once. 3. here is how to actually build a skill the right way most people identify a workflow and immediately try to write the skill. what you want to do instead is run the workflow by hand with the agent first. walk it through every single step. tell it what to check, what good looks like, what bad looks like. correct it in real time. once you have had a full successful run from start to finish, tell the agent to review everything it just did and write the skill itself. it writes a better skill than you will because it has the full context of what actually worked in practice not in theory. 4. recursively building skills is how you go from frustrated to reliable when the skill breaks, and it will break, ask the agent exactly why it failed. it will tell you specifically what went wrong. fix it together in that same conversation. then tell it to update the skill file so that failure mode never happens again. ross mike did this five times with his youtube report generator. it now pulls from eight different data sources and runs flawlessly every single time without him touching it. 5. sub agents are something you earn not something you set up on day one start with one agent. build one workflow. turn it into one skill. once that works add another. ross mike has five sub agents now covering marketing, business, personal and more. it took months to get there and every single one exists because a workflow proved it deserved to exist. the people who set up 15 sub agents on day one and wonder why nothing works skipped all the steps that make the thing actually run. 6. your workflow is the thing the model cannot get anywhere else the model has been trained on everything. it knows more than you about most things. what it does not have is your specific process, your taste, your way of doing things. that is what skills capture. that is what makes your agent actually useful versus a generic one. downloading someone else's skill means downloading their context onto your setup and it will not work the way you want it to because it was never built around how you work. this is the clearest explanation of how agents actually work i have heard. Micky runs this stuff every single day and the results show it. full episode is now live on The Startup Ideas Podcast (SIP) 🧃 where you get your pods people charge for this sorta stuff i give away the sauce for free i just want you to win watch

GREG ISENBERG

192,408 views • 2 months ago