Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Before Fable got released (and pulled) Mozilla was quietly testing Claude Mythos against Firefox's 10M line codebase. The result? Over 400 security bugs fixes, including ones that had been hiding in the codebase for over a decade. Brian Grinstead, distinguished engineer at Mozilla, walked me through the agentic bug-finding... show more

claire vo 🖤

56,832 subscribers

247,719 Aufrufe • vor 11 Tagen •via X (Twitter)

Bildung Gesundheit & Wellness Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Claude Code Unpacked! Visual walkthrough of the entire 500k-line leaked codebase. What happens when you type a message: - the agent loop - 50+ tools - multi-agent orchestration - unreleased features Want to understand the internals or build your own agent harness? start here:

Claude Code Unpacked! Visual walkthrough of the entire 500k-line leaked codebase. What happens when you type a message: - the agent loop - 50+ tools - multi-agent orchestration - unreleased features Want to understand the internals or build your own agent harness? start here:

Akshay 🚀

24,408 Aufrufe • vor 3 Monaten

The founder of LangChain says both models and harnesses have gotten really good between December and now. According to Harrison Chase, the core idea of an agent before Christmas was a model running in a loop and calling tools. This had been the north star for 3 years. - langchain had this when it launched - autogpt was the same idea - openclaw is kind of a future version of it Then about a year ago, they started getting really good. Claude Code, Manus, and Deep Research were all launched around the same time. All of them use the same pattern: running in a loop with harnesses (planning tools, file systems, code execution, etc) Harness engineering became a thing. Then Opus came out in November and really unlocked it. - the harness let the model do more and more - less hardcoded logic - way more control Then everyone went on vacation, played around, and realized that the model and the harness finally worked reliably.

The founder of LangChain says both models and harnesses have gotten really good between December and now. According to Harrison Chase, the core idea of an agent before Christmas was a model running in a loop and calling tools. This had been the north star for 3 years. - langchain had this when it launched - autogpt was the same idea - openclaw is kind of a future version of it Then about a year ago, they started getting really good. Claude Code, Manus, and Deep Research were all launched around the same time. All of them use the same pattern: running in a loop with harnesses (planning tools, file systems, code execution, etc) Harness engineering became a thing. Then Opus came out in November and really unlocked it. - the harness let the model do more and more - less hardcoded logic - way more control Then everyone went on vacation, played around, and realized that the model and the harness finally worked reliably.

Ivan Burazin

33,948 Aufrufe • vor 2 Monaten

Claude Code is the OpenClaw alternative you already had. I've been telling people this in person for three months. Now it's a piece on Every 📧. When OpenClaw was at peak hype a few months ago, everyone was going crazy over it — for good reasons. But almost nobody made the obvious comparison: Claude Code already does the thing you love about OpenClaw. In a lot of cases, better. Here's why people missed it. You have to understand the difference between a model and a harness. OpenClaw, at the end of the day, is just a harness on top of AI models. And Claude Code is too — it just got marketed as a coding tool, an alternative to Cursor. Until Cowork launched, almost nobody looked at it as a tool for non-programmers. Dan Shipper 📧 was one of the earliest to make that claim, and he was damn right. Let Claude Code run from the home folder of your computer with free access, and you get exactly what made OpenClaw exciting. Instead of just arguing it, here's a video walkthrough of a couple of my AI employees i built on the Claude Code harness. And here's the full article:

Claude Code is the OpenClaw alternative you already had. I've been telling people this in person for three months. Now it's a piece on Every 📧. When OpenClaw was at peak hype a few months ago, everyone was going crazy over it — for good reasons. But almost nobody made the obvious comparison: Claude Code already does the thing you love about OpenClaw. In a lot of cases, better. Here's why people missed it. You have to understand the difference between a model and a harness. OpenClaw, at the end of the day, is just a harness on top of AI models. And Claude Code is too — it just got marketed as a coding tool, an alternative to Cursor. Until Cowork launched, almost nobody looked at it as a tool for non-programmers. Dan Shipper 📧 was one of the earliest to make that claim, and he was damn right. Let Claude Code run from the home folder of your computer with free access, and you get exactly what made OpenClaw exciting. Instead of just arguing it, here's a video walkthrough of a couple of my AI employees i built on the Claude Code harness. And here's the full article:

Nityesh

21,964 Aufrufe • vor 7 Tagen

The best *code embedding* model in the market right now was just released: Qodo-Embed-1 — There are two flavors: A lite model with 1.5B parameters and a medium model with 7B parameters (Hugging Face links below). If you want to index a large codebase (supports 10M+ lines of code), this is the model you want. 1. Index your repositories 2. Ask anything (including test and code generation) The models are optimized to answer natural language questions or code-to-code questions. The video here shows the model indexing 90 repositories (!!!!!) and letting the user ask questions about them. The simplest way to use the model is through the Qodo Gen AI extension in Visual Studio Code, Cursor, or JetBrains (see link below).

The best code embedding model in the market right now was just released: Qodo-Embed-1 — There are two flavors: A lite model with 1.5B parameters and a medium model with 7B parameters (Hugging Face links below). If you want to index a large codebase (supports 10M+ lines of code), this is the model you want. 1. Index your repositories 2. Ask anything (including test and code generation) The models are optimized to answer natural language questions or code-to-code questions. The video here shows the model indexing 90 repositories (!!!!!) and letting the user ask questions about them. The simplest way to use the model is through the Qodo Gen AI extension in Visual Studio Code, Cursor, or JetBrains (see link below).

Santiago

56,584 Aufrufe • vor 1 Jahr

This is one of the coolest open-source AI agent projects I've seen in a while: 'Understand Anything' It's a plugin for Claude Code, Codex, OpenCode etc. that analyzes your codebase and turns it into a knowledge base that you can interact with. It explains the codebase to you, rather than showing you the structure. It seems like it's designed for code but I opened my Obsidian vault of podcast highlights in Claude Code, then ran /understand. The result is a knowledge graph that I can search of highlights from 888 podcast episodes and 144K lines of markdown text.

This is one of the coolest open-source AI agent projects I've seen in a while: 'Understand Anything' It's a plugin for Claude Code, Codex, OpenCode etc. that analyzes your codebase and turns it into a knowledge base that you can interact with. It explains the codebase to you, rather than showing you the structure. It seems like it's designed for code but I opened my Obsidian vault of podcast highlights in Claude Code, then ran /understand. The result is a knowledge graph that I can search of highlights from 888 podcast episodes and 144K lines of markdown text.

Dan McAteer

171,686 Aufrufe • vor 13 Tagen

I've long said that o3 is the best coding model - but if you're using an agent harness - Claude is just better at navigating your codebase. Enter the Repo Prompt pair programmer mode - it's the best of both words, as Claude coordinates with o3 to plan and apply edits for you!

I've long said that o3 is the best coding model - but if you're using an agent harness - Claude is just better at navigating your codebase. Enter the Repo Prompt pair programmer mode - it's the best of both words, as Claude coordinates with o3 to plan and apply edits for you!

eric provencher

93,571 Aufrufe • vor 11 Monaten

FREE CLAUDE FABLE 5 and almost nobody knows this trick Anthropic recently released their most capable model ever - the first Mythos-class model available to the public normally it's $10/$50 per million tokens but there's a hack: GitLab just added Fable 5 to Duo Agent Platform - across ALL tiers, including free trial the setup: 1/ register a GitLab account: 2/ start the free GitLab Duo trial 3/ Fable 5 is live through their AI Gateway 4/ done - Mythos-class model, $0 why Fable 5 is worth the hype: 1/ 80.3% SWE-Bench Pro - 11 points ahead of the next best model 2/ multi-DAY autonomous runs without losing the plot 3/ single-pass implementations of systems that took days of iteration 4/ Karpathy called it 'a major-version-bump-deserving step change' also free on Claude Pro/Max/Team until June 22 - after that it's usage credits you have a limited window to run the most capable public model on the planet for FREE

FREE CLAUDE FABLE 5 and almost nobody knows this trick Anthropic recently released their most capable model ever - the first Mythos-class model available to the public normally it's $10/$50 per million tokens but there's a hack: GitLab just added Fable 5 to Duo Agent Platform - across ALL tiers, including free trial the setup: 1/ register a GitLab account: 2/ start the free GitLab Duo trial 3/ Fable 5 is live through their AI Gateway 4/ done - Mythos-class model, $0 why Fable 5 is worth the hype: 1/ 80.3% SWE-Bench Pro - 11 points ahead of the next best model 2/ multi-DAY autonomous runs without losing the plot 3/ single-pass implementations of systems that took days of iteration 4/ Karpathy called it 'a major-version-bump-deserving step change' also free on Claude Pro/Max/Team until June 22 - after that it's usage credits you have a limited window to run the most capable public model on the planet for FREE

kaize

23,867 Aufrufe • vor 22 Tagen

In the future, you’ll be able to accomplish a goal by just giving Claude an outcome and a budget. That’s the direction Anthropic is building in with its new Managed Agents features, announced at this week’s Code with Claude developer event. The basic idea: Claude, wrapped in a computer in the cloud, that you can spin up, scale, and manage as needed. Anthropic is taking on the infrastructure that kills most agent products, and making sure that it scales to meet the needs of agents running 24/7. On this week’s AI & I from Every 📧, I talk with Angela Jiang (Angela Jiang), head of product for the Claude platform, and Katelyn Lesse (Katelyn Lesse), head of engineering for the Claude platform, about what Anthropic is building and what it takes to make agents reliable in production. We get into: - Why the "build a generic harness, hot-swap any model behind it" playbook is already outdated. Angela points to eval data on Memory where the same task across different harnesses performed drastically differently. - The infrastructure wall every team hits in production—and why Katelyn thinks “my sandbox died and took the agent with it” is the real reason internal agents don't ship. - Why Anthropic is so bullish on using file systems and skills within Claude, including Angela's argument that those early design choices can compound for years. This is a must-watch for anyone trying to take an agent past the demo and into production. Watch below! Timestamps: How the Claude platform evolved from API to agents: 00:01:48 The primitives that make up Claude Managed Agents: 00:04:09 Why the harness and the model are becoming a single unit: 00:10:37 The infrastructure wall that kills most agent projects in production: 00:18:49 Why team agents need a different shape than individual productivity tools: 00:24:49 How Anthropic's legal team uses an agent to review marketing copy: 00:26:36 Using multi-agent orchestration for advisor strategies, adversarial pairs, and swarms: 00:34:24 How to measure agent success with outcome and budget as the end state: 00:35:50 What the platform looks like a year from now, when Claude writes its own harness: 00:39:11

In the future, you’ll be able to accomplish a goal by just giving Claude an outcome and a budget. That’s the direction Anthropic is building in with its new Managed Agents features, announced at this week’s Code with Claude developer event. The basic idea: Claude, wrapped in a computer in the cloud, that you can spin up, scale, and manage as needed. Anthropic is taking on the infrastructure that kills most agent products, and making sure that it scales to meet the needs of agents running 24/7. On this week’s AI & I from Every 📧, I talk with Angela Jiang (Angela Jiang), head of product for the Claude platform, and Katelyn Lesse (Katelyn Lesse), head of engineering for the Claude platform, about what Anthropic is building and what it takes to make agents reliable in production. We get into: - Why the "build a generic harness, hot-swap any model behind it" playbook is already outdated. Angela points to eval data on Memory where the same task across different harnesses performed drastically differently. - The infrastructure wall every team hits in production—and why Katelyn thinks “my sandbox died and took the agent with it” is the real reason internal agents don't ship. - Why Anthropic is so bullish on using file systems and skills within Claude, including Angela's argument that those early design choices can compound for years. This is a must-watch for anyone trying to take an agent past the demo and into production. Watch below! Timestamps: How the Claude platform evolved from API to agents: 00:01:48 The primitives that make up Claude Managed Agents: 00:04:09 Why the harness and the model are becoming a single unit: 00:10:37 The infrastructure wall that kills most agent projects in production: 00:18:49 Why team agents need a different shape than individual productivity tools: 00:24:49 How Anthropic's legal team uses an agent to review marketing copy: 00:26:36 Using multi-agent orchestration for advisor strategies, adversarial pairs, and swarms: 00:34:24 How to measure agent success with outcome and budget as the end state: 00:35:50 What the platform looks like a year from now, when Claude writes its own harness: 00:39:11

Dan Shipper 📧

66,339 Aufrufe • vor 1 Monat

Anthropic engineer James Brady: "Every agent in production lies. We measured it. The good ones lie less, the great ones catch the lie before the user does." In 29 minutes, he walks through the verification stack he built and the patterns the Claude Code team adopted to keep agents honest at scale. Watch the full talk, then save the config below👇

Anthropic engineer James Brady: "Every agent in production lies. We measured it. The good ones lie less, the great ones catch the lie before the user does." In 29 minutes, he walks through the verification stack he built and the patterns the Claude Code team adopted to keep agents honest at scale. Watch the full talk, then save the config below👇

rody

340,927 Aufrufe • vor 27 Tagen

Karpathy said something you'll regret ignoring: "We have to keep the AI on the leash. I'm still the bottleneck. I have to make sure this thing isn't introducing bugs and that there's no security issues." He said it at YC talk last year, when the worry was reliability. The models hallucinated and made mistakes no human would, so the leash implied keeping yourself in the loop and checking the output before trusting it. The models are far better now, and the line still holds, for a reason he was not focused on back then. Even a model that writes flawless code today still has no idea who is allowed to run it. Correctness and authorization are different problems, and only correctness improves as the model improves. A perfect agent still hands a tool where anyone can do anything, because permission was never part of the task. I actually tested this in practice with Claude Code. I asked it to build a small internal tool with a button that issues account credits. It worked first try, and running it locally, the credit applied the instant I clicked. Nothing decided who was allowed to click it. The agent wrote the right logic and displayed a success notification. It never checked whether the caller had the right, whether it should pause for a human, or whether anything was logged. And this is not a bug a smarter model can outgrow because the leash was never in the code. Identity, permissions, and audit live in the system that runs the app, not in what the agent generates. To solve this, I took the exact same bundle and hosted it on Retool. The credit write that fired silently on my laptop now stopped at an approval gate, resolved to a real identity through SSO, and landed in an audit log. I wrote none of it. The app inherited the entire boundary the moment it was deployed, and the video shows the before and after. You can try it yourself here: I also wrote a detailed breakdown of the whole thing in my recent article, and I worked with the team to put this together. It walks through the build, the exact moment the credit write went through on my laptop with nobody checking, and then what changed when the same app ran on Retool. It also covers why this is a property of the runtime and not something a better model fixes, which is why devs typically miss this. The article is quoted below.

Karpathy said something you'll regret ignoring: "We have to keep the AI on the leash. I'm still the bottleneck. I have to make sure this thing isn't introducing bugs and that there's no security issues." He said it at YC talk last year, when the worry was reliability. The models hallucinated and made mistakes no human would, so the leash implied keeping yourself in the loop and checking the output before trusting it. The models are far better now, and the line still holds, for a reason he was not focused on back then. Even a model that writes flawless code today still has no idea who is allowed to run it. Correctness and authorization are different problems, and only correctness improves as the model improves. A perfect agent still hands a tool where anyone can do anything, because permission was never part of the task. I actually tested this in practice with Claude Code. I asked it to build a small internal tool with a button that issues account credits. It worked first try, and running it locally, the credit applied the instant I clicked. Nothing decided who was allowed to click it. The agent wrote the right logic and displayed a success notification. It never checked whether the caller had the right, whether it should pause for a human, or whether anything was logged. And this is not a bug a smarter model can outgrow because the leash was never in the code. Identity, permissions, and audit live in the system that runs the app, not in what the agent generates. To solve this, I took the exact same bundle and hosted it on Retool. The credit write that fired silently on my laptop now stopped at an approval gate, resolved to a real identity through SSO, and landed in an audit log. I wrote none of it. The app inherited the entire boundary the moment it was deployed, and the video shows the before and after. You can try it yourself here: I also wrote a detailed breakdown of the whole thing in my recent article, and I worked with the team to put this together. It walks through the build, the exact moment the credit write went through on my laptop with nobody checking, and then what changed when the same app ran on Retool. It also covers why this is a property of the runtime and not something a better model fixes, which is why devs typically miss this. The article is quoted below.

Akshay 🚀

42,511 Aufrufe • vor 11 Tagen

"The model is no longer the product. Codex, Perplexity Computer, or Claude Code - all are orchestration system. It takes a model and pairs it with an agent harness. What is an agent harness ? The rules for how the agent loops around" - Aravind Srinivas

"The model is no longer the product. Codex, Perplexity Computer, or Claude Code - all are orchestration system. It takes a model and pairs it with an agent harness. What is an agent harness ? The rules for how the agent loops around" - Aravind Srinivas

Rohan Paul

119,676 Aufrufe • vor 12 Tagen

Anthropic admitted they built an AI so capable they were scared to release it and the number that explains why is 250. Anthropic's CFO Krishna Rao described in this clip what happened when they ran Mythos against an open source codebase that a previous frontier model had already analyzed. The prior model found 22 security vulnerabilities, Mythos found 250. In the same codebase, that the previous model had already reviewed and flagged as relatively clean. That number, more than 11 times as many vulnerabilities discovered is not just a benchmark improvement, it is a signal that there is an entire layer of software infrastructure that humanity has been operating under the assumption was secure and that assumption may no longer hold. The UK AI Security Institute independently evaluated Mythos Preview and confirmed what the internal numbers suggested. On expert level capture the flag challenges that no model could complete before April 2025, Mythos succeeded 73% of the time and it became the first model ever to complete a complex end-to-end attack range from start to finish, autonomously, without human guidance. The World Economic Forum called this a new security-driven era for AI, the Governor of the Bank of England publicly warned that Anthropic may have found a way to unlock the entire cyber-risk landscape, and the European Central Bank began quietly contacting financial institutions to assess their security posture. The response from Anthropic is what makes this story genuinely important. Rather than shelving the model or publishing it as a standard API release, Rao described a phased approach restricting access to a controlled group, focusing specifically on how the cyber capabilities can be used defensively rather than offensively and treating that framework as a template for how to release powerful but dangerous models in the future. The broader context makes that framing even more significant. AI generated code is already creating ten times more security vulnerabilities than human-written code, 63% of organizations reported experiencing an AI driven cyberattack in the past 12 months, and traditional signature-based security tools were built for a threat model that no longer describes the attack surface companies are defending against. Mythos represents a genuine leap in what autonomous security reasoning can do and it cuts both ways. The model that can find 250 vulnerabilities in a codebase a prior model rated as mostly clean is also, in the wrong hands, the model that can exploit those 250 vulnerabilities before a human defender has even finished reading the report. Anthropic's phased release strategy is not just a legal or PR decision, it is the most honest signal yet from a frontier lab that safety governance and capability development can no longer be treated as separate workstreams. The question is not whether this technology gets deployed, it is whether the institutions using it defensively stay ahead of the ones who will eventually use it offensively and whether the labs building it can keep those two timelines from inverting.

Anthropic admitted they built an AI so capable they were scared to release it and the number that explains why is 250. Anthropic's CFO Krishna Rao described in this clip what happened when they ran Mythos against an open source codebase that a previous frontier model had already analyzed. The prior model found 22 security vulnerabilities, Mythos found 250. In the same codebase, that the previous model had already reviewed and flagged as relatively clean. That number, more than 11 times as many vulnerabilities discovered is not just a benchmark improvement, it is a signal that there is an entire layer of software infrastructure that humanity has been operating under the assumption was secure and that assumption may no longer hold. The UK AI Security Institute independently evaluated Mythos Preview and confirmed what the internal numbers suggested. On expert level capture the flag challenges that no model could complete before April 2025, Mythos succeeded 73% of the time and it became the first model ever to complete a complex end-to-end attack range from start to finish, autonomously, without human guidance. The World Economic Forum called this a new security-driven era for AI, the Governor of the Bank of England publicly warned that Anthropic may have found a way to unlock the entire cyber-risk landscape, and the European Central Bank began quietly contacting financial institutions to assess their security posture. The response from Anthropic is what makes this story genuinely important. Rather than shelving the model or publishing it as a standard API release, Rao described a phased approach restricting access to a controlled group, focusing specifically on how the cyber capabilities can be used defensively rather than offensively and treating that framework as a template for how to release powerful but dangerous models in the future. The broader context makes that framing even more significant. AI generated code is already creating ten times more security vulnerabilities than human-written code, 63% of organizations reported experiencing an AI driven cyberattack in the past 12 months, and traditional signature-based security tools were built for a threat model that no longer describes the attack surface companies are defending against. Mythos represents a genuine leap in what autonomous security reasoning can do and it cuts both ways. The model that can find 250 vulnerabilities in a codebase a prior model rated as mostly clean is also, in the wrong hands, the model that can exploit those 250 vulnerabilities before a human defender has even finished reading the report. Anthropic's phased release strategy is not just a legal or PR decision, it is the most honest signal yet from a frontier lab that safety governance and capability development can no longer be treated as separate workstreams. The question is not whether this technology gets deployed, it is whether the institutions using it defensively stay ahead of the ones who will eventually use it offensively and whether the labs building it can keep those two timelines from inverting.

Milk Road AI

24,356 Aufrufe • vor 1 Monat

THIS GUY CONNECTED HIS AI AGENTS TO HIS OBSIDIAN AND BUILT A BRAIN THAT LEARNS ON ITS OWN. HERE'S HOW TO BUILD IT Obsidian is just markdown files sitting in a folder. That turns out to be the perfect memory for an AI agent, because an agent can read and write those files directly. He wired his agents into the vault so they pull context from it, do the work, and write what they learned back. The notes aren't the point. The loop is, and it gets sharper every cycle How to build it: 1. Point an agent at your vault. The fastest way, no plugins, no API keys: open a terminal and run npx obsidian-mcp /path/to/your/vault. That exposes your Obsidian folder to Claude as a tool it can read, search, and write to. Add it to your Claude Code or Cowork config and restart 2. Confirm it can see the brain. Ask it: "list the notes in my vault and summarize what's in them." If it reads them back, the connection is live. Now it starts every task with everything the vault already holds instead of from zero 3. Give each agent one job and a write-back rule. Tell it: "research this, then save what you found as a new note in /brain with links to related notes." One agent researches, one summarizes, one plans. Each writes its output back into the vault 4. Close the loop. Add one line to every agent's instructions: "read /brain before starting, write your result back when done." Now each task leaves the vault richer, and the next run reads that before it works. It compounds instead of resetting 5. You only steer. Review what the brain produces, point it at the next thing. The agents handle the reading, writing, and connecting The edge isn't better notes. It's a brain that feeds itself, so the work gets sharper every cycle instead of starting over Bookmark this

THIS GUY CONNECTED HIS AI AGENTS TO HIS OBSIDIAN AND BUILT A BRAIN THAT LEARNS ON ITS OWN. HERE'S HOW TO BUILD IT Obsidian is just markdown files sitting in a folder. That turns out to be the perfect memory for an AI agent, because an agent can read and write those files directly. He wired his agents into the vault so they pull context from it, do the work, and write what they learned back. The notes aren't the point. The loop is, and it gets sharper every cycle How to build it: 1. Point an agent at your vault. The fastest way, no plugins, no API keys: open a terminal and run npx obsidian-mcp /path/to/your/vault. That exposes your Obsidian folder to Claude as a tool it can read, search, and write to. Add it to your Claude Code or Cowork config and restart 2. Confirm it can see the brain. Ask it: "list the notes in my vault and summarize what's in them." If it reads them back, the connection is live. Now it starts every task with everything the vault already holds instead of from zero 3. Give each agent one job and a write-back rule. Tell it: "research this, then save what you found as a new note in /brain with links to related notes." One agent researches, one summarizes, one plans. Each writes its output back into the vault 4. Close the loop. Add one line to every agent's instructions: "read /brain before starting, write your result back when done." Now each task leaves the vault richer, and the next run reads that before it works. It compounds instead of resetting 5. You only steer. Review what the brain produces, point it at the next thing. The agents handle the reading, writing, and connecting The edge isn't better notes. It's a brain that feeds itself, so the work gets sharper every cycle instead of starting over Bookmark this

Yarchi

57,768 Aufrufe • vor 26 Tagen

🚨 ANTHROPIC JUST REVEALED CLAUDE MYTHOS ABILITIES Anthropic just formally announced "Claude Mythos Preview" and launched "Project Glasswing" to deploy it for cybersecurity defense. The models are unlocking completely new, autonomous behaviors. This isn't about slightly better benchmark scores. This is about what the model can do. Here are the direct quotes from Anthropic’s research team (including Dario) on exactly what Mythos is capable of: • Chaining Exploits: "It has the ability to chain together vulnerabilities... this model is able to create exploits out of three, four, sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome." • The Professional Standard: "The model that we're experimenting with is, by and large, as good as a professional human at identifying bugs." • Unprecedented Autonomy: "It's just generally better at pursuing really long-range tasks that are kind of like the tasks that a human security researcher would do throughout the course of an entire day." The Reality Check: Dario Amodei flat out said: "There's a kind of accelerating exponential... Claude Mythos Preview is a particularly big jump along that point." Because this model has become so capable at identifying zero-days, they are restricting its release to top tech partners to try to patch the world's software before these capabilities leak out. The autonomous researcher era has officially arrived. It’s over 💀

🚨 ANTHROPIC JUST REVEALED CLAUDE MYTHOS ABILITIES Anthropic just formally announced "Claude Mythos Preview" and launched "Project Glasswing" to deploy it for cybersecurity defense. The models are unlocking completely new, autonomous behaviors. This isn't about slightly better benchmark scores. This is about what the model can do. Here are the direct quotes from Anthropic’s research team (including Dario) on exactly what Mythos is capable of: • Chaining Exploits: "It has the ability to chain together vulnerabilities... this model is able to create exploits out of three, four, sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome." • The Professional Standard: "The model that we're experimenting with is, by and large, as good as a professional human at identifying bugs." • Unprecedented Autonomy: "It's just generally better at pursuing really long-range tasks that are kind of like the tasks that a human security researcher would do throughout the course of an entire day." The Reality Check: Dario Amodei flat out said: "There's a kind of accelerating exponential... Claude Mythos Preview is a particularly big jump along that point." Because this model has become so capable at identifying zero-days, they are restricting its release to top tech partners to try to patch the world's software before these capabilities leak out. The autonomous researcher era has officially arrived. It’s over 💀

Chris

46,077 Aufrufe • vor 2 Monaten

The Claude Code SDK is now the Claude Agent SDK Why? Because we realized the Claude Code agent harness is useful for much more than coding. In fact, we're moving to using it to power most of our own agent loops at Anthropic.

The Claude Code SDK is now the Claude Agent SDK Why? Because we realized the Claude Code agent harness is useful for much more than coding. In fact, we're moving to using it to power most of our own agent loops at Anthropic.

Thariq

206,976 Aufrufe • vor 9 Monaten

an OpenAI engineer just showed how he gets agents to do his whole job: code, debug and more, using loops 29 minutes from the engineer who coined "harness engineering" he writes the rules, agents write the code, a reviewer agent loops until it's right the winners won't have the smartest model, they'll have the best loop around it watch it, then read the full guide on loops below

an OpenAI engineer just showed how he gets agents to do his whole job: code, debug and more, using loops 29 minutes from the engineer who coined "harness engineering" he writes the rules, agents write the code, a reviewer agent loops until it's right the winners won't have the smartest model, they'll have the best loop around it watch it, then read the full guide on loops below

Anatoli Kopadze

172,960 Aufrufe • vor 11 Tagen

Claude Security is now in public beta for Claude Enterprise customers. Claude scans your codebase for vulnerabilities, validates each finding to cut false positives, and suggests patches you can review and approve.

Claude Security is now in public beta for Claude Enterprise customers. Claude scans your codebase for vulnerabilities, validates each finding to cut false positives, and suggests patches you can review and approve.

Claude

4,899,263 Aufrufe • vor 2 Monaten

We just crossed $10M in ARR at Chatbase! 🎉 🎉 And today, we're launching Chatbase as the full harness for customer-facing AI agents. Similar to how Claude code is a harness for coding agents, Chatbase is the harness for customer experience agents. That means we give the model the context, tools, workflows, guardrails, and human-in-the-loop systems to be the best ambassador for your brand. It's going beyond just solving issues and is giving your customers the best experiences across every channel. This is a milestone I have been thinking about and obsessed with since day 1, and I am super excited to bring my vision for customer facing agents to life with Chatbase. Thank you to every one of our customers and to the amazing Chatbase team for getting us here! Next stop: $100M ARR

Yasser

746,867 Aufrufe • vor 1 Monat

🔥 💻 🎥 How to provide o1 Pro with your FULL CODEBASE through the ChatGPT Cursor connection yesterday I recorded a video connecting o1 Pro to Cursor through the ChatGPT Desktop app more than one person commented on how limiting it is for o1 Pro to only have access to a single open file in Cursor here's a walkthrough of how you can provide o1 Pro with your FULL CODEBASE as context instead of just a single file: 1. Write a Python function that concatenates full codebase into a single snapshot in a .txt file (link in comment) 2. Open that "codebase_snapshot.txt" in a separate Cursor pane 3. Go to the ChatGPT app and ask it what context it has access to through Cursor - should say it has two files with one of them being the codebase snapshot and boom there ya go. o1 Pro has access to everything in your codebase full walkthrough here 👇

🔥 💻 🎥 How to provide o1 Pro with your FULL CODEBASE through the ChatGPT Cursor connection yesterday I recorded a video connecting o1 Pro to Cursor through the ChatGPT Desktop app more than one person commented on how limiting it is for o1 Pro to only have access to a single open file in Cursor here's a walkthrough of how you can provide o1 Pro with your FULL CODEBASE as context instead of just a single file: 1. Write a Python function that concatenates full codebase into a single snapshot in a .txt file (link in comment) 2. Open that "codebase_snapshot.txt" in a separate Cursor pane 3. Go to the ChatGPT app and ask it what context it has access to through Cursor - should say it has two files with one of them being the codebase snapshot and boom there ya go. o1 Pro has access to everything in your codebase full walkthrough here 👇

Dan McAteer

99,719 Aufrufe • vor 1 Jahr

I tested Genspark new AI Developer: an L4 agent that builds a working app from an idea. You can choose any model (including Claude), and it runs in your browser and in the app—including planning, code, testing, and fixes. The output looks like it was done by a senior developer. I used Prompt to build a functional newsflash page that creates and curates a newsletter for me by adding a few links! It's revolutionary. Prompt in the comments. Genspark

I tested Genspark new AI Developer: an L4 agent that builds a working app from an idea. You can choose any model (including Claude), and it runs in your browser and in the app—including planning, code, testing, and fixes. The output looks like it was done by a senior developer. I used Prompt to build a functional newsflash page that creates and curates a newsletter for me by adding a few links! It's revolutionary. Prompt in the comments. Genspark

Chubby♨️

235,727 Aufrufe • vor 10 Monaten