Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

📣 Introducing SWE-PolyBench: A new open-source multilingual benchmark for evaluating #AI coding agents SWE-PolyBench is the first benchmark to evaluate AI coding agents' ability to understand complex codebases, helping advance AI performance in the real world. Learn more. 👉

Amazon Web Services

2,232,005 subscribers

10,866 просмотров • 1 год назад •via X (Twitter)

Наука и технологии #AI

Anya Rossi• Live Now

Private livecam show

Комментарии: 3

Фото профиля Anthony Jia Sides

Anthony Jia Sides1 год назад

Lol

Фото профиля BlockseBlock

BlockseBlock1 год назад

SWE-PolyBench is a great step for AI coding agents

Фото профиля solitarycyclist90

solitarycyclist901 год назад

Why do techies feel they all need to dress up like Steve Jobs when presenting?

Похожие видео

Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser. When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted. Check out what Devin can do in the thread below.

Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser. When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted. Check out what Devin can do in the thread below.

Cognition

31,438,898 просмотров • 2 лет назад

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From NVIDIA Berkeley AI Research CMU Robotics Institute Stanford AI Lab 🧵

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From NVIDIA Berkeley AI Research CMU Robotics Institute Stanford AI Lab 🧵

Max Fu

168,956 просмотров • 2 месяцев назад

10× AI performance is the new benchmark Dr. Lisa Su introduces AMD Helios and MI455, delivering up to 10x more AI performance to power stronger models, smarter agents and next-gen applications.

10× AI performance is the new benchmark Dr. Lisa Su introduces AMD Helios and MI455, delivering up to 10x more AI performance to power stronger models, smarter agents and next-gen applications.

CES

10,693 просмотров • 5 месяцев назад

Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.

Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.

Ramp Labs

169,234 просмотров • 3 дней назад

MiniMax M2.1 is officially live🚀 Built for real-world coding and AI-native organizations — from vibe builds to serious workflows. A SOTA 10B-activated OSS coding & agent model, scoring 72.5% on SWE-multilingual and 88.6% on our newly open-sourced VIBE-bench, exceeding leading closed-source models like Gemini 3 Pro and Claude 4.5 Sonnet. The most powerful OSS model for the agentic era is here.

MiniMax M2.1 is officially live🚀 Built for real-world coding and AI-native organizations — from vibe builds to serious workflows. A SOTA 10B-activated OSS coding & agent model, scoring 72.5% on SWE-multilingual and 88.6% on our newly open-sourced VIBE-bench, exceeding leading closed-source models like Gemini 3 Pro and Claude 4.5 Sonnet. The most powerful OSS model for the agentic era is here.

MiniMax (official)

1,073,384 просмотров • 5 месяцев назад

What is the best LLM for agentic software engineering? Today, we're releasing The OpenHands Index to answer this question. It's the first broad-coverage benchmark for AI coding agents, comparing them on accuracy, cost, and runtime across 5 task domains.

What is the best LLM for agentic software engineering? Today, we're releasing The OpenHands Index to answer this question. It's the first broad-coverage benchmark for AI coding agents, comparing them on accuracy, cost, and runtime across 5 task domains.

OpenHands

31,780 просмотров • 4 месяцев назад

Anthropic CPO, Mike Krieger: Dario Amodei predicted the coding benchmark (SWE-bench) would reach 90% by the end of the year I’ve started taking AI timelines more seriously after seeing the progress. "mid-2025 now feels much closer than 2027"

Anthropic CPO, Mike Krieger: Dario Amodei predicted the coding benchmark (SWE-bench) would reach 90% by the end of the year I’ve started taking AI timelines more seriously after seeing the progress. "mid-2025 now feels much closer than 2027"

Haider.

46,067 просмотров • 1 год назад

Today, we’re introducing Decide Agent, a frontier-level AI agent for Excel analysis. Decide Agent scored 82.50% on SpreadsheetBench, the public standard benchmark for Excel AI agents, ranking #4 globally. This is the same leaderboard used by Claude, OpenAI, and Microsoft to evaluate their spreadsheet agents. We’re here to push the frontier in Excel and data analytics. Decide Agent is now generally available at trydecide . ai

Today, we’re introducing Decide Agent, a frontier-level AI agent for Excel analysis. Decide Agent scored 82.50% on SpreadsheetBench, the public standard benchmark for Excel AI agents, ranking #4 globally. This is the same leaderboard used by Claude, OpenAI, and Microsoft to evaluate their spreadsheet agents. We’re here to push the frontier in Excel and data analytics. Decide Agent is now generally available at trydecide . ai

Ab.

40,286 просмотров • 4 месяцев назад

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. OpenAI GPT 5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1.

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. OpenAI GPT 5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1.

adarsh

207,664 просмотров • 2 месяцев назад

Shoutout to xAI team for hustling at 2:00 am to help bring this over the finish line Introducing Agent Runner: the first open-source agent harness run with real users to create a live benchmark of real-world coding We trace tool-calls, reprompting, and multifile edits, starting with the best from OpenAI, xAI, Google DeepMind, Anthropic, Mistral AI, Z.ai, Kimi.ai

Shoutout to xAI team for hustling at 2:00 am to help bring this over the finish line Introducing Agent Runner: the first open-source agent harness run with real users to create a live benchmark of real-world coding We trace tool-calls, reprompting, and multifile edits, starting with the best from OpenAI, xAI, Google DeepMind, Anthropic, Mistral AI, Z.ai, Kimi.ai

Grace Li

119,237 просмотров • 6 месяцев назад

🕵‍Jules is an AI coding agent that can handle real coding challenges, improve and understand large codebases, and asynchronously tackle tasks to help you work more efficiently.

🕵‍Jules is an AI coding agent that can handle real coding challenges, improve and understand large codebases, and asynchronously tackle tasks to help you work more efficiently.

Google AI Developers

36,897 просмотров • 1 год назад

This is unironically an excellent benchmark for AI voice agents

This is unironically an excellent benchmark for AI voice agents

Justine Moore

266,416 просмотров • 6 месяцев назад

AI coding tools are everywhere. Yet they all fall short when scaling to 100s of developers and complex codebases. We’re solving this problem Augment Code. Introducing Augment Code, the first developer AI purpose-built for teams.🧵

AI coding tools are everywhere. Yet they all fall short when scaling to 100s of developers and complex codebases. We’re solving this problem Augment Code. Introducing Augment Code, the first developer AI purpose-built for teams.🧵

Scott Dietzen

91,325 просмотров • 1 год назад

Introducing Traces A new way to share and discover traces from coding agents, and a small step to make AI more multiplayer. Here's how it works:

Introducing Traces A new way to share and discover traces from coding agents, and a small step to make AI more multiplayer. Here's how it works:

Tarun Sachdeva

55,614 просмотров • 3 месяцев назад

Introducing Helmor The open-source, local-first answer to Conductor. A more refined, faster GUI for orchestrating coding agents. No cloud. One-click import from Conductor. AI made coding faster. Helmor is about finishing the rest of the loop: orchestration, workspaces, review, testing, and merge. We believe the next generation of GUI agent orchestration should be built in the open — by the community.

Introducing Helmor The open-source, local-first answer to Conductor. A more refined, faster GUI for orchestrating coding agents. No cloud. One-click import from Conductor. AI made coding faster. Helmor is about finishing the rest of the loop: orchestration, workspaces, review, testing, and merge. We believe the next generation of GUI agent orchestration should be built in the open — by the community.

Caspian 東澔

116,076 просмотров • 1 месяц назад

✅Announcing our partnership with Manta Network (🔱,🔱) to accelerate on-chain #AI content and onchain-AI agents Orbofi AI, the most adopted AI engine in web3, is thrilled to announce its partnership with Manta Network (🔱,🔱) , The modular and the fastest growing L2, as we jointly empower consumers and developers to create AI assets and AI agents using the Orbofi engine and tokenize them on Manta in a few clicks Orbofi AI is acting as a factory engine for on-chain AI assets and AI agents for the Manta community and the overall web3 ecosystem, powered by open-source and distributed AI models, setting a new benchmark for on-chain AI applications. 🪙Join the Creative campaign, create your first AI NFT on Manta, and earn rewards: 📙Learn more about the partnership:

✅Announcing our partnership with Manta Network (🔱,🔱) to accelerate on-chain #AI content and onchain-AI agents Orbofi AI, the most adopted AI engine in web3, is thrilled to announce its partnership with Manta Network (🔱,🔱) , The modular and the fastest growing L2, as we jointly empower consumers and developers to create AI assets and AI agents using the Orbofi engine and tokenize them on Manta in a few clicks Orbofi AI is acting as a factory engine for on-chain AI assets and AI agents for the Manta community and the overall web3 ecosystem, powered by open-source and distributed AI models, setting a new benchmark for on-chain AI applications. 🪙Join the Creative campaign, create your first AI NFT on Manta, and earn rewards: 📙Learn more about the partnership:

Orbofi

99,545 просмотров • 2 лет назад

Building trusted AI agents is easier on Hedera. The open-source Hedera AI Agent Kit lets developers create AI-powered applications that understand natural language and perform real on-chain actions. Learn more: Get started:

Building trusted AI agents is easier on Hedera. The open-source Hedera AI Agent Kit lets developers create AI-powered applications that understand natural language and perform real on-chain actions. Learn more: Get started:

Hedera

17,788 просмотров • 7 дней назад

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE.

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE.

Alistair

819,999 просмотров • 1 год назад

AI coding agents aren't just about autocorrect, but how do you get the best coding experience with an AI-powered coding agent? In our new short course, Build Apps with Windsurf’s AI Coding Agents, you'll learn how to build, debug, and deploy applications with agentic AI-powered integrated development environment (IDE). AI coding agents, like Codeium's @windsurf, don’t just suggest code, they analyze your codebase, track changes, retrieve relevant information, and apply updates across multiple files. They can help debug, refactor, and even modernize legacy frameworks. But to use them effectively, you need the right approach. This new course shows you how to: 🛠️ Use AI agents to build and refine applications, like a Wikipedia analysis app. 🐞 Debug and refactor JavaScript with AI-assisted automation. 🔍 Understand how search and retrieval power AI coding agents. 🤖 Guide an AI agent effectively—prompting, iterating, and correcting when needed. Taught by Anshul Ramachandran (Anshul Ramachandran), this course gives you hands-on coding experience, insights into how these AI systems work under the hood, and best practices to improve your development workflow. 🔗 Enroll for free:

AI coding agents aren't just about autocorrect, but how do you get the best coding experience with an AI-powered coding agent? In our new short course, Build Apps with Windsurf’s AI Coding Agents, you'll learn how to build, debug, and deploy applications with agentic AI-powered integrated development environment (IDE). AI coding agents, like Codeium's @windsurf, don’t just suggest code, they analyze your codebase, track changes, retrieve relevant information, and apply updates across multiple files. They can help debug, refactor, and even modernize legacy frameworks. But to use them effectively, you need the right approach. This new course shows you how to: 🛠️ Use AI agents to build and refine applications, like a Wikipedia analysis app. 🐞 Debug and refactor JavaScript with AI-assisted automation. 🔍 Understand how search and retrieval power AI coding agents. 🤖 Guide an AI agent effectively—prompting, iterating, and correcting when needed. Taught by Anshul Ramachandran (Anshul Ramachandran), this course gives you hands-on coding experience, insights into how these AI systems work under the hood, and best practices to improve your development workflow. 🔗 Enroll for free:

DeepLearning.AI

23,255 просмотров • 1 год назад

Hoping your coding agents could understand you and adapt to your preferences? Meet TOM-SWE, our new framework for coding agents that don’t just write code, but model the user's mind persistently (ranging from general preferences to small details) arxiv: ❓Motivation: Most coding agents today can plan, edit, run, and test code. But they still fail at a key part of real-world development, understanding the user! Underspecified, shifting, or context-dependent instructions can easily break them. You must have those moments when coding agents were running for 10 minutes and ended up producing things largely misaligned. (1/)

Hoping your coding agents could understand you and adapt to your preferences? Meet TOM-SWE, our new framework for coding agents that don’t just write code, but model the user's mind persistently (ranging from general preferences to small details) arxiv: ❓Motivation: Most coding agents today can plan, edit, run, and test code. But they still fail at a key part of real-world development, understanding the user! Underspecified, shifting, or context-dependent instructions can easily break them. You must have those moments when coding agents were running for 10 minutes and ended up producing things largely misaligned. (1/)

Xuhui Zhou

39,728 просмотров • 7 месяцев назад