Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Karpathy's Agentic Engineering finally has proper tooling! (built by Google) Karpathy defined agentic engineering as the discipline that separates production agent work from vibe coding. The core skills he listed were spec design, eval loops, and security oversight. The problem has been that practicing this still requires a different... tool for every phase: - editor for code - a terminal for scaffolding - a browser for testing - a cloud console for deployment - and a separate framework for evals. Every transition is a context switch. The solution to production-grade Agentic Engineering is now actually implemented in Google’s Agents CLI. It covers the entire workflow in one place for scaffolding, evaluating, and deploying ADK agents. One setup command injects 7 ADK-specific skills into a coding agent's context, which lets it handle scaffolding, evals, deployment, and enterprise registration through natural language. I tested this end-to-end by building a RAG agent from scratch using Claude Code. It scaffolded the full project from the ADK agentic_rag template, generated 20 eval scenarios with LLM-as-judge scoring, and returned a quantitative scorecard. Finally, it also deployed everything to Agent Runtime and registered the agent to Gemini Enterprise, so the entire org can discover and use it. The video below shows this in action, and I worked with the Google Cloud team to put this together. Agents CLI GitHub repo → (don't forget to star it ⭐ ) I wrote up the full build covering all six steps from install to enterprise registration. It includes the eval scorecard, the instruction loophole the eval caught before deployment, and what the deployment process actually looks like end-to-end. Read it below.show more

Akshay 🚀

279,919 subscribers

242,901 views • 2 days ago •via X (Twitter)

Education Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

The Android team did it again. They released Android CLI: A command line tool to let agents build and interact with Android apps. This joins the Android Skills as a new initiative to make Android fully ready for the agentic era 🔥💯

The Android team did it again. They released Android CLI: A command line tool to let agents build and interact with Android apps. This joins the Android Skills as a new initiative to make Android fully ready for the agentic era 🔥💯

Jorge Castillo

21,038 views • 2 months ago

Excited to launch a new way to upskill with AI agents. This is how we are making it possible for anyone to learn to build with coding agents. To start, we are launching 4 new hands-on labs on the following topics: - Agent Skills - Agentic Image Generation - 30 Days of Hermes Agents - Prompt Engineering with Agents I am confident that with our new DAIR.AI platform, anyone can learn to become a top AI builder by building and acquiring highly-demanded AI skills. And there is a lot more landing in the coming weeks.

Excited to launch a new way to upskill with AI agents. This is how we are making it possible for anyone to learn to build with coding agents. To start, we are launching 4 new hands-on labs on the following topics: - Agent Skills - Agentic Image Generation - 30 Days of Hermes Agents - Prompt Engineering with Agents I am confident that with our new DAIR.AI platform, anyone can learn to become a top AI builder by building and acquiring highly-demanded AI skills. And there is a lot more landing in the coming weeks.

elvis

17,141 views • 22 days ago

OpenAI's AgentKit will be so insane, build every step of agents on one platform. These visual agent builders make the whole process of iterating and launching agents far more efficient. It sits on top of the Responses API and unifies the tools that were previously scattered across SDKs and custom orchestration. It lets developers create agent workflows visually, connect data sources securely, and measure performance automatically without coding every layer by hand. The core of AgentKit is the Agent Builder, a drag-and-drop canvas where each node represents an action, guardrail, or decision branch. Developers can link these nodes into multi-agent workflows, preview results instantly, and version each setup. It supports inline evaluation so that developers can see how changes affect output before deploying. The Connector Registry is a single admin panel that manages how data and tools connect across the OpenAI ecosystem. It centralizes integrations like Google Drive, SharePoint, Dropbox, and Microsoft Teams. Large organizations can govern access and flow of data between agents securely under one global console. ChatKit provides a ready-to-use chat interface for embedding agents inside apps or websites. It manages streaming, message threads, and model reasoning displays automatically. Developers can skin the interface to match their product without writing custom front-end code. Under the hood, all these blocks use the same execution core that runs agent reasoning through OpenAI’s APIs. Workflows in Agent Builder compile down to structured instructions for the Responses API, which handles model calls, tool use, and context passing. Connector Registry handles authentication and routing for external tools, while Evals and RFT provide feedback loops that improve agents over time. This integration means developers no longer need to handle orchestration logic, model evaluation pipelines, or safety layers separately. Everything runs natively within OpenAI’s control plane with managed security, automatic versioning, and built-in testing. In short, AgentKit standardizes the entire life cycle of an AI agent—from visual design to deployment and performance tuning—inside a single unified system.

OpenAI's AgentKit will be so insane, build every step of agents on one platform. These visual agent builders make the whole process of iterating and launching agents far more efficient. It sits on top of the Responses API and unifies the tools that were previously scattered across SDKs and custom orchestration. It lets developers create agent workflows visually, connect data sources securely, and measure performance automatically without coding every layer by hand. The core of AgentKit is the Agent Builder, a drag-and-drop canvas where each node represents an action, guardrail, or decision branch. Developers can link these nodes into multi-agent workflows, preview results instantly, and version each setup. It supports inline evaluation so that developers can see how changes affect output before deploying. The Connector Registry is a single admin panel that manages how data and tools connect across the OpenAI ecosystem. It centralizes integrations like Google Drive, SharePoint, Dropbox, and Microsoft Teams. Large organizations can govern access and flow of data between agents securely under one global console. ChatKit provides a ready-to-use chat interface for embedding agents inside apps or websites. It manages streaming, message threads, and model reasoning displays automatically. Developers can skin the interface to match their product without writing custom front-end code. Under the hood, all these blocks use the same execution core that runs agent reasoning through OpenAI’s APIs. Workflows in Agent Builder compile down to structured instructions for the Responses API, which handles model calls, tool use, and context passing. Connector Registry handles authentication and routing for external tools, while Evals and RFT provide feedback loops that improve agents over time. This integration means developers no longer need to handle orchestration logic, model evaluation pipelines, or safety layers separately. Everything runs natively within OpenAI’s control plane with managed security, automatic versioning, and built-in testing. In short, AgentKit standardizes the entire life cycle of an AI agent—from visual design to deployment and performance tuning—inside a single unified system.

Rohan Paul

178,460 views • 8 months ago

ANTHROPIC JUST TURNED AI AGENTS INTO GIT REPOS Anthropic shipped "ant" - a CLI that runs every Claude API endpoint straight from your terminal. The headline isn't the terminal access. It's that you can now version-control an AI agent as YAML in Git and have CI sync it to the Claude Platform, the same way you ship code. - Every API resource is a subcommand: messages, models, files, agents, sessions - Define an agent in a YAML file, check it into your repo, and keep it in sync with one update command - Spin up a session, send it an event, then pull every event and tool call back from the same CLI - Claude Code knows how to drive ant out of the box - it shells out and reads the results with no glue code Agents just stopped being prompts you babysit and became infrastructure you deploy.

ANTHROPIC JUST TURNED AI AGENTS INTO GIT REPOS Anthropic shipped "ant" - a CLI that runs every Claude API endpoint straight from your terminal. The headline isn't the terminal access. It's that you can now version-control an AI agent as YAML in Git and have CI sync it to the Claude Platform, the same way you ship code. - Every API resource is a subcommand: messages, models, files, agents, sessions - Define an agent in a YAML file, check it into your repo, and keep it in sync with one update command - Spin up a session, send it an event, then pull every event and tool call back from the same CLI - Claude Code knows how to drive ant out of the box - it shells out and reads the results with no glue code Agents just stopped being prompts you babysit and became infrastructure you deploy.

BuBBliK

200,080 views • 29 days ago

🚀 LangSmith for Startups Spotlight: Cogent Security Cogent is building AI agents that protect the world's largest organizations from cyberattacks. One of the hardest problems in cybersecurity is going from finding a vulnerability to actually fixing it. Cogent is automating that entire process from end-to-end. Cogent is already working with dozens of Fortune 1000 and Global 2000 enterprise customers such as major universities, hospitality brands, and consumer retailers. Cogent uses LangSmith for production tracing and monitoring of our agents. Their team leverages execution traces for usage insight and use-case categorization, self-refinement loops to diagnose eval failures, and online evaluators to flag undesired behavior. Join their team if you want to build frontier AI for mission critical problems 🤝

🚀 LangSmith for Startups Spotlight: Cogent Security Cogent is building AI agents that protect the world's largest organizations from cyberattacks. One of the hardest problems in cybersecurity is going from finding a vulnerability to actually fixing it. Cogent is automating that entire process from end-to-end. Cogent is already working with dozens of Fortune 1000 and Global 2000 enterprise customers such as major universities, hospitality brands, and consumer retailers. Cogent uses LangSmith for production tracing and monitoring of our agents. Their team leverages execution traces for usage insight and use-case categorization, self-refinement loops to diagnose eval failures, and online evaluators to flag undesired behavior. Join their team if you want to build frontier AI for mission critical problems 🤝

LangChain

18,513 views • 3 months ago

HTML Artifacts are a big part of how I work with agents now. Artifacts can be more than just static files. When combined with agents, they can take action or help you take action. This unlocks all kinds of interesting ways to work with agents. This is clearly the future. Check out this writing and scheduler artifact I built in a few minutes. It uses a bit of HTML and JS. All the data is in markdown (Obsidian vaults), so the agent can access and modify it at any time. No DB needed. No sophisticated functionalities. The agent decides all that for me based on the skills, context, and memory it has access to. The best part about this simple stack is that all the important information stays with me. This has allowed me to build a recursive self-improving system and automations that can better tap into coding agents like Codex or Claude Code. I could have paid or built an entire app for scheduling posts, and there are so many of them out there. But I don't need to. I've realized a simple artifact does the job. And the simplicity of it is actually an advantage. Very little maintenance for very high returns on personalization, time, and efficiency. The other benefit of this is that I can add features as I please. That level of personalization feels magical, and we should all be pursuing more of it. All of this just keeps compounding. Of course, this example is just about writing. But I have similar artifacts for research, design, experimentation, evaluation, and so much more. And no, I didn't actually publish the post example I shared in the clip. It was just for demonstration purposes. I actually spend more time than this when writing together with agents. Lastly, having built my own agent orchestrator tool has made me realize that simplifying the tool stack is a superpower. If you are curious about how all this works, I will do a live session next week:

HTML Artifacts are a big part of how I work with agents now. Artifacts can be more than just static files. When combined with agents, they can take action or help you take action. This unlocks all kinds of interesting ways to work with agents. This is clearly the future. Check out this writing and scheduler artifact I built in a few minutes. It uses a bit of HTML and JS. All the data is in markdown (Obsidian vaults), so the agent can access and modify it at any time. No DB needed. No sophisticated functionalities. The agent decides all that for me based on the skills, context, and memory it has access to. The best part about this simple stack is that all the important information stays with me. This has allowed me to build a recursive self-improving system and automations that can better tap into coding agents like Codex or Claude Code. I could have paid or built an entire app for scheduling posts, and there are so many of them out there. But I don't need to. I've realized a simple artifact does the job. And the simplicity of it is actually an advantage. Very little maintenance for very high returns on personalization, time, and efficiency. The other benefit of this is that I can add features as I please. That level of personalization feels magical, and we should all be pursuing more of it. All of this just keeps compounding. Of course, this example is just about writing. But I have similar artifacts for research, design, experimentation, evaluation, and so much more. And no, I didn't actually publish the post example I shared in the clip. It was just for demonstration purposes. I actually spend more time than this when writing together with agents. Lastly, having built my own agent orchestrator tool has made me realize that simplifying the tool stack is a superpower. If you are curious about how all this works, I will do a live session next week:

elvis

18,374 views • 1 month ago

Simplicity is at the heart of great software. This is one of the reasons why Claude Code has been sticky for me. As a builder, I love planning and brainstorming, and this is now a key focus of Claude Code. I use Shift + Tab a lot to cycle between brainstorming, planning, and execution. This functionality provides the appropriate interface for me to either be very involved or less involved as I please. This works particularly well when building out new and complex features or entire new projects. This saves a huge amount of time. It allows me to tune Claude Code to execute and build more effectively. It also builds a loop of trust, and I often (surprisingly) find Claude Code asking for clarifications when it's confused. Coding agents don't normally do that. I have shared before on the power of brainstorming with AI for longer times. Try it and you will not be disappointed. Vibe coding is fun, but pair it with intentional development cycles, and you watch how far you can take a project with coding agents today.

Simplicity is at the heart of great software. This is one of the reasons why Claude Code has been sticky for me. As a builder, I love planning and brainstorming, and this is now a key focus of Claude Code. I use Shift + Tab a lot to cycle between brainstorming, planning, and execution. This functionality provides the appropriate interface for me to either be very involved or less involved as I please. This works particularly well when building out new and complex features or entire new projects. This saves a huge amount of time. It allows me to tune Claude Code to execute and build more effectively. It also builds a loop of trust, and I often (surprisingly) find Claude Code asking for clarifications when it's confused. Coding agents don't normally do that. I have shared before on the power of brainstorming with AI for longer times. Try it and you will not be disappointed. Vibe coding is fun, but pair it with intentional development cycles, and you watch how far you can take a project with coding agents today.

elvis

81,765 views • 8 months ago

warp code feels like a combination of a cli agent and cursor-style ux design it's a cli that looks like an ide because it gives you: - editor code view - project explorer - one-click to view command output - switch between agent/cli - context/credit spend tracking - task lists - shared context with warp drive there is a learning curve because it's a different workflow, but the agent was top of terminal bench until recently and i can see why would love to see them add: - subagents - an agent sdk - sidebar fonts increasing with cmd +/- not being paid to post this, btw (feel like i have to add that these days 😉) i have been using warp for a long while as a terminal and sometimes agent on the $15/mo plan

warp code feels like a combination of a cli agent and cursor-style ux design it's a cli that looks like an ide because it gives you: - editor code view - project explorer - one-click to view command output - switch between agent/cli - context/credit spend tracking - task lists - shared context with warp drive there is a learning curve because it's a different workflow, but the agent was top of terminal bench until recently and i can see why would love to see them add: - subagents - an agent sdk - sidebar fonts increasing with cmd +/- not being paid to post this, btw (feel like i have to add that these days 😉) i have been using warp for a long while as a terminal and sometimes agent on the $15/mo plan

Ian Nuttall

32,665 views • 8 months ago

Replit, Vercel, and OpenAI have built very cool agent-native applications, but nobody else has passed the demo stage. Building agents that work is complex. Teams aren't shipping agents because we don't have good tooling yet (and most of us don't know how to do this well.) A couple of days ago, the CopilotKit🪁 team announced a collaboration with . You can now use LangGraph with CoAgents to build agent-native applications, and here is everything you need to know about that: CoAgents is fully open-source, and you can use it to do the following: • Human-in-the-loop to steer and correct the agent • Stream intermediate agent state • Real-time state sharing between the agent and the application • Agentic generative UI to build trust that the agent is on the right path Start this GitHub Repository: Thanks to the team for giving me early access and collaborating with me on this post.

Replit, Vercel, and OpenAI have built very cool agent-native applications, but nobody else has passed the demo stage. Building agents that work is complex. Teams aren't shipping agents because we don't have good tooling yet (and most of us don't know how to do this well.) A couple of days ago, the CopilotKit🪁 team announced a collaboration with . You can now use LangGraph with CoAgents to build agent-native applications, and here is everything you need to know about that: CoAgents is fully open-source, and you can use it to do the following: • Human-in-the-loop to steer and correct the agent • Stream intermediate agent state • Real-time state sharing between the agent and the application • Agentic generative UI to build trust that the agent is on the right path Start this GitHub Repository: Thanks to the team for giving me early access and collaborating with me on this post.

Santiago

63,071 views • 1 year ago

Devin for Terminal is the first CLI agent with its own dedicated virtual machine. You can start a project locally and hand it off to Devin to test it end-to-end. Devin for Terminal is super performant, and it works with all frontier models. Try it out today.

Devin for Terminal is the first CLI agent with its own dedicated virtual machine. You can start a project locally and hand it off to Devin to test it end-to-end. Devin for Terminal is super performant, and it works with all frontier models. Try it out today.

Cognition

3,744,423 views • 2 months ago

Devin for Terminal is the first CLI agent with its own dedicated virtual machine. You can start a project locally and hand it off to Devin to test it end-to-end. Devin for Terminal is super performant, and it works with all frontier models. Try it out today.

Devin for Terminal is the first CLI agent with its own dedicated virtual machine. You can start a project locally and hand it off to Devin to test it end-to-end. Devin for Terminal is super performant, and it works with all frontier models. Try it out today.

Cognition

7,009,239 views • 2 months ago

Very pleasantly surprised to discover Cursor cloud agents can playtest the godot game I built. See the (sped up) video below of the agent playtesting the game. As I was watching it play the game, I can see the agent slowly learn how the game works and familiarise with the game's UI. I also realised that the agent is a very 'safe' player, choosing to play very safely and retreating from battle if it foresees it can't defeat. Very interesting to see. I wonder if I could simulate different game playtester behaviours that mimic different types of real-world player archetypes. With agentic playtesting, this means that the agents are able to provide actual gameplay feedback and suggestions to improve the game, having played the game itself. This unlocks a whole lot of possibilities for AI-assisted game dev, since it closes the playtest loop. This feels like the future of recursive game development, where agents can now recursively build > playtest > improve the games they are working on. Thanks edwin for letting me know that these agents can actually playtest games, not just software! Very excited to dig deeper to see what I can do with these agents with computer access!

Very pleasantly surprised to discover Cursor cloud agents can playtest the godot game I built. See the (sped up) video below of the agent playtesting the game. As I was watching it play the game, I can see the agent slowly learn how the game works and familiarise with the game's UI. I also realised that the agent is a very 'safe' player, choosing to play very safely and retreating from battle if it foresees it can't defeat. Very interesting to see. I wonder if I could simulate different game playtester behaviours that mimic different types of real-world player archetypes. With agentic playtesting, this means that the agents are able to provide actual gameplay feedback and suggestions to improve the game, having played the game itself. This unlocks a whole lot of possibilities for AI-assisted game dev, since it closes the playtest loop. This feels like the future of recursive game development, where agents can now recursively build > playtest > improve the games they are working on. Thanks edwin for letting me know that these agents can actually playtest games, not just software! Very excited to dig deeper to see what I can do with these agents with computer access!

Danny Limanseta

52,093 views • 4 months ago

THIS DEVELOPER USED OPENCLAW AGENTS TO RUN HIS B2B BUSINESS VIA TELEGRAM AND MADE $15,000/MONTH he doesn't write prompts from scratch or use generic browser interfaces. he runs a multi-agent framework through a mobile chat. the agents write code, test deployments, and update sites in real-time while he just hits approve the setup is straightforward: - spin up Coolify on a free cloud instance to host your own self-hosted agent panels - link the agent loop to a Telegram gateway to approve code edits from your phone - deploy specialized skill files directly to limit token waste and context decay - containerize the terminal execution using Docker to prevent security breaches if you are still running local agents without container safety, you are leaving money on the table. read the 30-day battle between OpenClaw and Hermes Agent to see who actually wins in production Full breakdown and migration playbook ↓

THIS DEVELOPER USED OPENCLAW AGENTS TO RUN HIS B2B BUSINESS VIA TELEGRAM AND MADE $15,000/MONTH he doesn't write prompts from scratch or use generic browser interfaces. he runs a multi-agent framework through a mobile chat. the agents write code, test deployments, and update sites in real-time while he just hits approve the setup is straightforward: - spin up Coolify on a free cloud instance to host your own self-hosted agent panels - link the agent loop to a Telegram gateway to approve code edits from your phone - deploy specialized skill files directly to limit token waste and context decay - containerize the terminal execution using Docker to prevent security breaches if you are still running local agents without container safety, you are leaving money on the table. read the 30-day battle between OpenClaw and Hermes Agent to see who actually wins in production Full breakdown and migration playbook ↓

marfin

26,654 views • 20 days ago

Finally redesigned my personal site and went with a warmer, more personal aesthetic I turned the entire CSS spec and Apple’s Human Interface Guidelines into hundreds of Agent Skills, fed it to Claude Opus 4.5 + my direction and it implemented the redesign for me

Finally redesigned my personal site and went with a warmer, more personal aesthetic I turned the entire CSS spec and Apple’s Human Interface Guidelines into hundreds of Agent Skills, fed it to Claude Opus 4.5 + my direction and it implemented the redesign for me

Jane Manchun Wong

134,699 views • 5 months ago

Claude Code Scheduled Tasks is now available... here's a solid idea to connect it with Telegram Save this so you don't forget to set it up! First, ask Claude to add a simple Telegram messaging module to your repo. You can use the Telegram Bot Builder Skill from Link: Install command: npx claude-code-templates@latest --skill enterprise-communication/telegram-bot-builder Once the module is in your project, grab your bot credentials from BotFather and add the bot ID to your .env file That's it! ✅ Now every Scheduled Task you create should end with an instruction for Claude to send the task result to Telegram using that module. Claude will handle the delivery automatically on every task it runs

Claude Code Scheduled Tasks is now available... here's a solid idea to connect it with Telegram Save this so you don't forget to set it up! First, ask Claude to add a simple Telegram messaging module to your repo. You can use the Telegram Bot Builder Skill from Link: Install command: npx claude-code-templates@latest --skill enterprise-communication/telegram-bot-builder Once the module is in your project, grab your bot credentials from BotFather and add the bot ID to your .env file That's it! ✅ Now every Scheduled Task you create should end with an instruction for Claude to send the task result to Telegram using that module. Claude will handle the delivery automatically on every task it runs

Daniel San

91,123 views • 3 months ago

Claude Code + Google Stitch 2.0 is f*cking cracked 🤯 Google just dropped a free AI design agent that solves Claude Code's biggest weakness: frontend design. One screenshot of a high-converting landing page → a production-ready site for your brand in minutes. All inside Google Stitch + Claude Code. Perfect for DTC brands and agencies who are building advertorial pages and product launch pages for Meta but burning days on designer back-and-forth. If you're running Meta ads and need 5-10 different landing pages testing different hooks, angles, and offers — each one targeting a different audience and pain point — you know the bottleneck isn't the ads. It's the pages. Briefing designers, waiting for revisions, paying $2-5K per page. Stitch eliminates the design bottleneck: → Find a high-converting advertorial that's scaling on Meta → Screenshot it and drop it into Stitch (powered by Gemini 3.1) → Stitch redesigns it with your brand's colors, fonts, and imagery using Nano Banana 2 → Edit sections visually — headlines, CTAs, layouts — without touching code → Export the code and paste it into Claude Code → Claude builds the full production site and deploys to Vercel or Netlify in 60 seconds No designer. No $3K per landing page. No Claude Code frontend that looks like a template from 2019. What you get: → Designer-quality landing pages and advertorials built in minutes, not weeks → Visual editing so you actually see the design before you code it → Nano Banana 2 generating on-brand product imagery and hero shots → A repeatable system — new angle, new page, same pipeline Built 100% with Google Stitch 2.0 + Claude Code. I put together a full playbook showing the exact workflow: how to find winning pages, redesign them in Stitch, and deploy with Claude Code. Want it for free? > Like this post > Comment "STITCH" And I'll send it over (must be following so I can DM)

Mike Futia

125,355 views • 3 months ago

The Visual Studio Code insiders version that just shipped and will ship in the next few days will come with an insane amount of new capabilities. A few highlights: - You can now run sub-agents in parallel. Yes, really. I even attached a video. - Major UX improvements for sub agents, especially visible in the chat window - A new search tool wrapped as a sub-agent that iteratively runs multiple search tools: semantic_search, file_search, grep_search Which connects nicely to the point above: multiple searches running in parallel, efficiently and fast - Anthropic’s Message API is now enabled by default - You can choose the model for the cloud agent (three available, all premium) - Extended thinking support when using the Claude cloud agent This is part of the broader multi-vendor cloud support under AgentsHQ I wrote about a few weeks ago - Tasks sent to the background agent (basically the CLI tool) now always run in isolation, each with its own git worktree - In a multi-repo workspace, assigning a task to a cloud agent prompts you to choose the target repo Same behavior when opening an empty workspace with no repo - Support for building an external index for files not supported by GitHub’s default indexing - UI/UX improvements for starting new sessions and switching between local / background / cloud agents - Skills are now first-class citizens, just like prompt files, with better UX indicating when a skill is loaded - Improved API for dynamic contribution of prompt files New V2 includes skills as part of the model. Curious to see the extensions that will leverage this - Finally, initial support for showing context usage percentage per session - Skills are enabled by default - Resizable chat window and session view. Small thing, but it was driving me crazy 😁 - A new integrated browser meant to replace the old simple browser Maybe the beginning of real browser use? - Better UI/UX for token streaming in chat - Ability to index external files not supported by GitHub There’s a lot more. Some of it hasn’t fully landed yet, but everything that has is already in Insiders. The next stable release should drop in early February. As usual, I’m just shocked by the volume of features this team ships every month. After the holiday slowdown, this one is shaping up to be a wild release.

The Visual Studio Code insiders version that just shipped and will ship in the next few days will come with an insane amount of new capabilities. A few highlights: - You can now run sub-agents in parallel. Yes, really. I even attached a video. - Major UX improvements for sub agents, especially visible in the chat window - A new search tool wrapped as a sub-agent that iteratively runs multiple search tools: semantic_search, file_search, grep_search Which connects nicely to the point above: multiple searches running in parallel, efficiently and fast - Anthropic’s Message API is now enabled by default - You can choose the model for the cloud agent (three available, all premium) - Extended thinking support when using the Claude cloud agent This is part of the broader multi-vendor cloud support under AgentsHQ I wrote about a few weeks ago - Tasks sent to the background agent (basically the CLI tool) now always run in isolation, each with its own git worktree - In a multi-repo workspace, assigning a task to a cloud agent prompts you to choose the target repo Same behavior when opening an empty workspace with no repo - Support for building an external index for files not supported by GitHub’s default indexing - UI/UX improvements for starting new sessions and switching between local / background / cloud agents - Skills are now first-class citizens, just like prompt files, with better UX indicating when a skill is loaded - Improved API for dynamic contribution of prompt files New V2 includes skills as part of the model. Curious to see the extensions that will leverage this - Finally, initial support for showing context usage percentage per session - Skills are enabled by default - Resizable chat window and session view. Small thing, but it was driving me crazy 😁 - A new integrated browser meant to replace the old simple browser Maybe the beginning of real browser use? - Better UI/UX for token streaming in chat - Ability to index external files not supported by GitHub There’s a lot more. Some of it hasn’t fully landed yet, but everything that has is already in Insiders. The next stable release should drop in early February. As usual, I’m just shocked by the volume of features this team ships every month. After the holiday slowdown, this one is shaping up to be a wild release.

Oren Melamed

29,555 views • 5 months ago

Unpopular opinion: Most agent evals are theatre. You run them once before the deployment. It'll take 800ms+ as another LLM would be judging your LLM. Most annoying part - no one tells where in the chain things went wrong. I wasted a lot of time in this loop. And then I came across Future AGI bringing 5 different tools under one umbrella, best part - the platform is completely open source. They open sourced their entire platform and the eval layer is noticeably different. It is multimodal - works on everything text, image, audio, pdf. Not an LLM-as-judge adding latency but an agent with memory and tools. The biggest win are learned classifiers trained on actual production failure patterns to run evals at low cost. It also runs across the full reasoning chain, not just the final response. Check out → Try it here →

Unpopular opinion: Most agent evals are theatre. You run them once before the deployment. It'll take 800ms+ as another LLM would be judging your LLM. Most annoying part - no one tells where in the chain things went wrong. I wasted a lot of time in this loop. And then I came across Future AGI bringing 5 different tools under one umbrella, best part - the platform is completely open source. They open sourced their entire platform and the eval layer is noticeably different. It is multimodal - works on everything text, image, audio, pdf. Not an LLM-as-judge adding latency but an agent with memory and tools. The biggest win are learned classifiers trained on actual production failure patterns to run evals at low cost. It also runs across the full reasoning chain, not just the final response. Check out → Try it here →

Swapna Kumar Panda

49,557 views • 2 months ago

Building AI agents is finally simple — and Airia is leading the way. I’ve been testing Airia AI , enterprise AI orchestration platform that unifies every model, workflow, and data source into one secure environment. Whether you’re a developer, analyst, creator, or enterprise leader, Airia makes it incredibly easy to build powerful AI agents — without wrestling with multiple tools or complex integrations. Using the no-code builder, you can drag-and-drop actions, connect data, choose your LLM, and launch an agent in minutes. Then run it live, publish it, and even share it with the Airia Community, home to 2,500+ pre-built agents you can use or remix. If you want to automate workflows, prototype faster, or explore real enterprise AI use cases, Airia is the place to start. 👉 Build your first agent today: 👉 Explore the community: #Airia #AgenticAI #AIOrchestration #AIAgents #AIWorkflow #DigitalTransformation

Building AI agents is finally simple — and Airia is leading the way. I’ve been testing Airia AI , enterprise AI orchestration platform that unifies every model, workflow, and data source into one secure environment. Whether you’re a developer, analyst, creator, or enterprise leader, Airia makes it incredibly easy to build powerful AI agents — without wrestling with multiple tools or complex integrations. Using the no-code builder, you can drag-and-drop actions, connect data, choose your LLM, and launch an agent in minutes. Then run it live, publish it, and even share it with the Airia Community, home to 2,500+ pre-built agents you can use or remix. If you want to automate workflows, prototype faster, or explore real enterprise AI use cases, Airia is the place to start. 👉 Build your first agent today: 👉 Explore the community: #Airia #AgenticAI #AIOrchestration #AIAgents #AIWorkflow #DigitalTransformation

Adarsh Chetan

268,444 views • 7 months ago

Background Terminal is a big unlock for Codex. It opens so many possibilities by giving the agent more autonomy. In this video, the agent spins up a server in the background and is able to debug it meanwhile. Try it out in the experimental features!

Background Terminal is a big unlock for Codex. It opens so many possibilities by giving the agent more autonomy. In this video, the agent spins up a server in the background and is able to debug it meanwhile. Try it out in the experimental features!

Ahmed

64,165 views • 6 months ago