Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

📣 Introducing SWE-PolyBench: A new open-source multilingual benchmark for evaluating #AI coding agents SWE-PolyBench is the first benchmark to evaluate AI coding agents' ability to understand complex codebases, helping advance AI performance in the real world. Learn more. 👉

Amazon Web Services

2,232,005 subscribers

10,866 views • 1 year ago •via X (Twitter)

Science & Technology #AI

Anya Rossi• Live Now

Private livecam show

3 Comments

Anthony Jia Sides1 year ago

Lol

BlockseBlock1 year ago

SWE-PolyBench is a great step for AI coding agents

solitarycyclist901 year ago

Why do techies feel they all need to dress up like Steve Jobs when presenting?

Related Videos

Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser. When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted. Check out what Devin can do in the thread below.

Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser. When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted. Check out what Devin can do in the thread below.

Cognition

31,436,462 views • 2 years ago

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From NVIDIA Berkeley AI Research CMU Robotics Institute Stanford AI Lab 🧵

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From NVIDIA Berkeley AI Research CMU Robotics Institute Stanford AI Lab 🧵

Max Fu

167,781 views • 2 months ago

10× AI performance is the new benchmark Dr. Lisa Su introduces AMD Helios and MI455, delivering up to 10x more AI performance to power stronger models, smarter agents and next-gen applications.

10× AI performance is the new benchmark Dr. Lisa Su introduces AMD Helios and MI455, delivering up to 10x more AI performance to power stronger models, smarter agents and next-gen applications.

CES

10,693 views • 4 months ago

MiniMax M2.1 is officially live🚀 Built for real-world coding and AI-native organizations — from vibe builds to serious workflows. A SOTA 10B-activated OSS coding & agent model, scoring 72.5% on SWE-multilingual and 88.6% on our newly open-sourced VIBE-bench, exceeding leading closed-source models like Gemini 3 Pro and Claude 4.5 Sonnet. The most powerful OSS model for the agentic era is here.

MiniMax M2.1 is officially live🚀 Built for real-world coding and AI-native organizations — from vibe builds to serious workflows. A SOTA 10B-activated OSS coding & agent model, scoring 72.5% on SWE-multilingual and 88.6% on our newly open-sourced VIBE-bench, exceeding leading closed-source models like Gemini 3 Pro and Claude 4.5 Sonnet. The most powerful OSS model for the agentic era is here.

MiniMax (official)

1,073,282 views • 5 months ago

What is the best LLM for agentic software engineering? Today, we're releasing The OpenHands Index to answer this question. It's the first broad-coverage benchmark for AI coding agents, comparing them on accuracy, cost, and runtime across 5 task domains.

What is the best LLM for agentic software engineering? Today, we're releasing The OpenHands Index to answer this question. It's the first broad-coverage benchmark for AI coding agents, comparing them on accuracy, cost, and runtime across 5 task domains.

OpenHands

31,591 views • 4 months ago

Today, we’re introducing Decide Agent, a frontier-level AI agent for Excel analysis. Decide Agent scored 82.50% on SpreadsheetBench, the public standard benchmark for Excel AI agents, ranking #4 globally. This is the same leaderboard used by Claude, OpenAI, and Microsoft to evaluate their spreadsheet agents. We’re here to push the frontier in Excel and data analytics. Decide Agent is now generally available at trydecide . ai

Today, we’re introducing Decide Agent, a frontier-level AI agent for Excel analysis. Decide Agent scored 82.50% on SpreadsheetBench, the public standard benchmark for Excel AI agents, ranking #4 globally. This is the same leaderboard used by Claude, OpenAI, and Microsoft to evaluate their spreadsheet agents. We’re here to push the frontier in Excel and data analytics. Decide Agent is now generally available at trydecide . ai

Ab.

40,286 views • 4 months ago

Anthropic CPO, Mike Krieger: Dario Amodei predicted the coding benchmark (SWE-bench) would reach 90% by the end of the year I’ve started taking AI timelines more seriously after seeing the progress. "mid-2025 now feels much closer than 2027"

Anthropic CPO, Mike Krieger: Dario Amodei predicted the coding benchmark (SWE-bench) would reach 90% by the end of the year I’ve started taking AI timelines more seriously after seeing the progress. "mid-2025 now feels much closer than 2027"

Haider.

46,067 views • 1 year ago

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. OpenAI GPT 5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1.

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. OpenAI GPT 5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1.

adarsh

206,554 views • 2 months ago

Shoutout to xAI team for hustling at 2:00 am to help bring this over the finish line Introducing Agent Runner: the first open-source agent harness run with real users to create a live benchmark of real-world coding We trace tool-calls, reprompting, and multifile edits, starting with the best from OpenAI, xAI, Google DeepMind, Anthropic, Mistral AI, Z.ai, Kimi.ai

Shoutout to xAI team for hustling at 2:00 am to help bring this over the finish line Introducing Agent Runner: the first open-source agent harness run with real users to create a live benchmark of real-world coding We trace tool-calls, reprompting, and multifile edits, starting with the best from OpenAI, xAI, Google DeepMind, Anthropic, Mistral AI, Z.ai, Kimi.ai

Grace Li

119,237 views • 6 months ago

This is unironically an excellent benchmark for AI voice agents

This is unironically an excellent benchmark for AI voice agents

Justine Moore

266,335 views • 6 months ago

🕵‍Jules is an AI coding agent that can handle real coding challenges, improve and understand large codebases, and asynchronously tackle tasks to help you work more efficiently.

🕵‍Jules is an AI coding agent that can handle real coding challenges, improve and understand large codebases, and asynchronously tackle tasks to help you work more efficiently.

Google AI Developers

36,897 views • 1 year ago

AI coding tools are everywhere. Yet they all fall short when scaling to 100s of developers and complex codebases. We’re solving this problem Augment Code. Introducing Augment Code, the first developer AI purpose-built for teams.🧵

AI coding tools are everywhere. Yet they all fall short when scaling to 100s of developers and complex codebases. We’re solving this problem Augment Code. Introducing Augment Code, the first developer AI purpose-built for teams.🧵

Scott Dietzen

91,168 views • 1 year ago

Introducing Traces A new way to share and discover traces from coding agents, and a small step to make AI more multiplayer. Here's how it works:

Introducing Traces A new way to share and discover traces from coding agents, and a small step to make AI more multiplayer. Here's how it works:

Tarun Sachdeva

54,523 views • 3 months ago

Introducing Helmor The open-source, local-first answer to Conductor. A more refined, faster GUI for orchestrating coding agents. No cloud. One-click import from Conductor. AI made coding faster. Helmor is about finishing the rest of the loop: orchestration, workspaces, review, testing, and merge. We believe the next generation of GUI agent orchestration should be built in the open — by the community.

Introducing Helmor The open-source, local-first answer to Conductor. A more refined, faster GUI for orchestrating coding agents. No cloud. One-click import from Conductor. AI made coding faster. Helmor is about finishing the rest of the loop: orchestration, workspaces, review, testing, and merge. We believe the next generation of GUI agent orchestration should be built in the open — by the community.

Caspian 東澔

115,775 views • 1 month ago

✅Announcing our partnership with Manta Network (🔱,🔱) to accelerate on-chain #AI content and onchain-AI agents Orbofi AI, the most adopted AI engine in web3, is thrilled to announce its partnership with Manta Network (🔱,🔱) , The modular and the fastest growing L2, as we jointly empower consumers and developers to create AI assets and AI agents using the Orbofi engine and tokenize them on Manta in a few clicks Orbofi AI is acting as a factory engine for on-chain AI assets and AI agents for the Manta community and the overall web3 ecosystem, powered by open-source and distributed AI models, setting a new benchmark for on-chain AI applications. 🪙Join the Creative campaign, create your first AI NFT on Manta, and earn rewards: 📙Learn more about the partnership:

✅Announcing our partnership with Manta Network (🔱,🔱) to accelerate on-chain #AI content and onchain-AI agents Orbofi AI, the most adopted AI engine in web3, is thrilled to announce its partnership with Manta Network (🔱,🔱) , The modular and the fastest growing L2, as we jointly empower consumers and developers to create AI assets and AI agents using the Orbofi engine and tokenize them on Manta in a few clicks Orbofi AI is acting as a factory engine for on-chain AI assets and AI agents for the Manta community and the overall web3 ecosystem, powered by open-source and distributed AI models, setting a new benchmark for on-chain AI applications. 🪙Join the Creative campaign, create your first AI NFT on Manta, and earn rewards: 📙Learn more about the partnership:

Orbofi

99,545 views • 2 years ago

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE.

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE.

Alistair

819,953 views • 1 year ago

Introducing the new AI first vibe coding experience in Google AI Studio! Built to take you from prompt to production with Gemini, and optimized for AI app creation. Start building AI apps for free : ) More updates and features to come!

Introducing the new AI first vibe coding experience in Google AI Studio! Built to take you from prompt to production with Gemini, and optimized for AI app creation. Start building AI apps for free : ) More updates and features to come!

Logan Kilpatrick

1,580,520 views • 7 months ago

AI coding agents aren't just about autocorrect, but how do you get the best coding experience with an AI-powered coding agent? In our new short course, Build Apps with Windsurf’s AI Coding Agents, you'll learn how to build, debug, and deploy applications with agentic AI-powered integrated development environment (IDE). AI coding agents, like Codeium's @windsurf, don’t just suggest code, they analyze your codebase, track changes, retrieve relevant information, and apply updates across multiple files. They can help debug, refactor, and even modernize legacy frameworks. But to use them effectively, you need the right approach. This new course shows you how to: 🛠️ Use AI agents to build and refine applications, like a Wikipedia analysis app. 🐞 Debug and refactor JavaScript with AI-assisted automation. 🔍 Understand how search and retrieval power AI coding agents. 🤖 Guide an AI agent effectively—prompting, iterating, and correcting when needed. Taught by Anshul Ramachandran (Anshul Ramachandran), this course gives you hands-on coding experience, insights into how these AI systems work under the hood, and best practices to improve your development workflow. 🔗 Enroll for free:

AI coding agents aren't just about autocorrect, but how do you get the best coding experience with an AI-powered coding agent? In our new short course, Build Apps with Windsurf’s AI Coding Agents, you'll learn how to build, debug, and deploy applications with agentic AI-powered integrated development environment (IDE). AI coding agents, like Codeium's @windsurf, don’t just suggest code, they analyze your codebase, track changes, retrieve relevant information, and apply updates across multiple files. They can help debug, refactor, and even modernize legacy frameworks. But to use them effectively, you need the right approach. This new course shows you how to: 🛠️ Use AI agents to build and refine applications, like a Wikipedia analysis app. 🐞 Debug and refactor JavaScript with AI-assisted automation. 🔍 Understand how search and retrieval power AI coding agents. 🤖 Guide an AI agent effectively—prompting, iterating, and correcting when needed. Taught by Anshul Ramachandran (Anshul Ramachandran), this course gives you hands-on coding experience, insights into how these AI systems work under the hood, and best practices to improve your development workflow. 🔗 Enroll for free:

DeepLearning.AI

23,184 views • 1 year ago

Hoping your coding agents could understand you and adapt to your preferences? Meet TOM-SWE, our new framework for coding agents that don’t just write code, but model the user's mind persistently (ranging from general preferences to small details) arxiv: ❓Motivation: Most coding agents today can plan, edit, run, and test code. But they still fail at a key part of real-world development, understanding the user! Underspecified, shifting, or context-dependent instructions can easily break them. You must have those moments when coding agents were running for 10 minutes and ended up producing things largely misaligned. (1/)

Hoping your coding agents could understand you and adapt to your preferences? Meet TOM-SWE, our new framework for coding agents that don’t just write code, but model the user's mind persistently (ranging from general preferences to small details) arxiv: ❓Motivation: Most coding agents today can plan, edit, run, and test code. But they still fail at a key part of real-world development, understanding the user! Underspecified, shifting, or context-dependent instructions can easily break them. You must have those moments when coding agents were running for 10 minutes and ended up producing things largely misaligned. (1/)

Xuhui Zhou

36,211 views • 7 months ago