Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. OpenAI... show more

adarsh

7,046 subscribers

207,142 views • 2 months ago •via X (Twitter)

Education Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Wave 9 is here: a frontier model built for software engineering. Introducing our new family of models: SWE-1, SWE-1-lite, and SWE-1-mini. Based on internal evals, it has performance nearing that of frontier models from the foundation labs. Available now, only in Windsurf!

Wave 9 is here: a frontier model built for software engineering. Introducing our new family of models: SWE-1, SWE-1-lite, and SWE-1-mini. Based on internal evals, it has performance nearing that of frontier models from the foundation labs. Available now, only in Windsurf!

Windsurf

742,465 views • 1 year ago

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE.

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE.

Alistair

819,953 views • 1 year ago

Are AI agents ready to be your virtual coworker? Can they write your emails, build Excel models, and create slide decks? Introducing APEX-Agents, a frontier benchmark that tests how well AI agents complete real, long-horizon professional services deliverables in Google Workspace. Mercor APEX-Agents Pass@1 leaderboard: 🥇 Google Gemini 3 Flash (High): 24.0% 🥈 OpenAI GPT-5.2 (High): 23.0% 🥉 Paul Jankura Claude Opus 4.5 (High): 18.4%

Are AI agents ready to be your virtual coworker? Can they write your emails, build Excel models, and create slide decks? Introducing APEX-Agents, a frontier benchmark that tests how well AI agents complete real, long-horizon professional services deliverables in Google Workspace. Mercor APEX-Agents Pass@1 leaderboard: 🥇 Google Gemini 3 Flash (High): 24.0% 🥈 OpenAI GPT-5.2 (High): 23.0% 🥉 Paul Jankura Claude Opus 4.5 (High): 18.4%

Brendan (can/do)

145,326 views • 4 months ago

When the way software is built changes, we help companies adopt it. For over a decade, that meant bringing React Native to production across mobile, web, desktop, and TV. Now we are expanding that same mission to AI-native engineering: helping teams ship software with agents, safely and at scale. This is our new manifesto.

When the way software is built changes, we help companies adopt it. For over a decade, that meant bringing React Native to production across mobile, web, desktop, and TV. Now we are expanding that same mission to AI-native engineering: helping teams ship software with agents, safely and at scale. This is our new manifesto.

Callstack Engineers

16,312 views • 14 days ago

Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.

Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.

OpenAI

13,104,931 views • 1 month ago

FULL INTERVIEW: Sam Altman joins TBPN to discuss GPT-5.3-Codex, AI agents, Anthropic's Super Bowl ads, and more. 00:00 GPT-5.3-Codex 02:27 AI agents and the future of work 03:20 The role of forward-deployed engineers in AI 05:42 AI benchmarks 07:29 Emotional attachment to chatbots 10:40 On data and compute being the 'new oil' 12:56 Is software dead? 17:48 Codex Desktop and the rise of the general-purpose work agent 25:00 OpenAI’s last Super Bowl ad and the Anthropic ads

FULL INTERVIEW: Sam Altman joins TBPN to discuss GPT-5.3-Codex, AI agents, Anthropic's Super Bowl ads, and more. 00:00 GPT-5.3-Codex 02:27 AI agents and the future of work 03:20 The role of forward-deployed engineers in AI 05:42 AI benchmarks 07:29 Emotional attachment to chatbots 10:40 On data and compute being the 'new oil' 12:56 Is software dead? 17:48 Codex Desktop and the rise of the general-purpose work agent 25:00 OpenAI’s last Super Bowl ad and the Anthropic ads

TBPN

639,180 views • 4 months ago

📣 Introducing SWE-PolyBench: A new open-source multilingual benchmark for evaluating #AI coding agents SWE-PolyBench is the first benchmark to evaluate AI coding agents' ability to understand complex codebases, helping advance AI performance in the real world. Learn more. 👉

📣 Introducing SWE-PolyBench: A new open-source multilingual benchmark for evaluating #AI coding agents SWE-PolyBench is the first benchmark to evaluate AI coding agents' ability to understand complex codebases, helping advance AI performance in the real world. Learn more. 👉

Amazon Web Services

10,866 views • 1 year ago

Building a star on Earth is a big engineering challenge. To tackle it, we’ve partnered with two industry powerhouses to create a high-fidelity digital twin of our #SPARC fusion machine. Here’s how this partnership is coming to life: ⚡️ Siemens: Data and Design — #Siemens software like Teamcenter and Designcenter NX let us manage SPARC’s complex design, housing data for our digital twin and handling traditional computer-aided design work. ⚡️ NVIDIA: AI Accelerator — NVIDIA’s Omniverse library provides a visually intuitive, 3D mirror of real-world SPARC that could improve planning and operations. And #NVIDIA software tools can help us build AI models of physics simulations that run enormously faster than traditional software. ⚡️ CFS: The Fusion Frontier — Integrating real-world and simulation data will let us benefit from AI-boosted physics software and compare its results with real-world data from how SPARC actually operates. Dive deeper into the details here and learn more about how our power trio is accelerating commercial fusion: #FusionEnergy #DigitalTwin #AI

Building a star on Earth is a big engineering challenge. To tackle it, we’ve partnered with two industry powerhouses to create a high-fidelity digital twin of our #SPARC fusion machine. Here’s how this partnership is coming to life: ⚡️ Siemens: Data and Design — #Siemens software like Teamcenter and Designcenter NX let us manage SPARC’s complex design, housing data for our digital twin and handling traditional computer-aided design work. ⚡️ NVIDIA: AI Accelerator — NVIDIA’s Omniverse library provides a visually intuitive, 3D mirror of real-world SPARC that could improve planning and operations. And #NVIDIA software tools can help us build AI models of physics simulations that run enormously faster than traditional software. ⚡️ CFS: The Fusion Frontier — Integrating real-world and simulation data will let us benefit from AI-boosted physics software and compare its results with real-world data from how SPARC actually operates. Dive deeper into the details here and learn more about how our power trio is accelerating commercial fusion: #FusionEnergy #DigitalTwin #AI

Commonwealth Fusion Systems

13,796 views • 5 months ago

Today, we’re introducing Spawnlabs We built it to push the limits of what software can be spawned, from simple tools to complex systems Software has always been a way to extend what we can do and create. and now we're building spawnlabs to automate the ability to spawn software that solves harder problems and builds deeper systems. live now

Today, we’re introducing Spawnlabs We built it to push the limits of what software can be spawned, from simple tools to complex systems Software has always been a way to extend what we can do and create. and now we're building spawnlabs to automate the ability to spawn software that solves harder problems and builds deeper systems. live now

teddy.

67,010 views • 5 months ago

Can AI actually automate jobs? Scale AI and are launching the Remote Labor Index (RLI), the first benchmark and public leaderboard that test how well AI agents can complete real, paid freelance work in domains like software engineering, design, architecture, data analysis, and more. Early results show the limits of today’s models. The top AI agent completed just 2.5% of real freelance jobs better than humans. AI is powerful, but not yet reliable enough to replace skilled labor. RLI gives us a transparent way to track progress over time and bring clarity to the future of work.

Can AI actually automate jobs? Scale AI and are launching the Remote Labor Index (RLI), the first benchmark and public leaderboard that test how well AI agents can complete real, paid freelance work in domains like software engineering, design, architecture, data analysis, and more. Early results show the limits of today’s models. The top AI agent completed just 2.5% of real freelance jobs better than humans. AI is powerful, but not yet reliable enough to replace skilled labor. RLI gives us a transparent way to track progress over time and bring clarity to the future of work.

Bing Liu

424,350 views • 7 months ago

Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser. When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted. Check out what Devin can do in the thread below.

Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser. When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted. Check out what Devin can do in the thread below.

Cognition

31,437,691 views • 2 years ago

Cognition is partnering with Mercedes-Benz to accelerate software engineering across their global engineering teams, representing one of the most extensive deployments of AI software engineering in the automotive industry to date. Scott Wu sat down with Katrin Lehmann, Mercedes-Benz CIO, to discuss the work:

Cognition is partnering with Mercedes-Benz to accelerate software engineering across their global engineering teams, representing one of the most extensive deployments of AI software engineering in the automotive industry to date. Scott Wu sat down with Katrin Lehmann, Mercedes-Benz CIO, to discuss the work:

Cognition

152,931 views • 1 month ago

The new Factory software AI agents just combined OpenAI’s Codex, Devin, and Cursor into one product. They are fully autonomous agents called Droids that perform deep code research, technical docs, and do autonomous end-to-end coding. Here’s how it works (with real examples)👇:

The new Factory software AI agents just combined OpenAI’s Codex, Devin, and Cursor into one product. They are fully autonomous agents called Droids that perform deep code research, technical docs, and do autonomous end-to-end coding. Here’s how it works (with real examples)👇:

Alvaro Cintas

93,225 views • 1 year ago

GPT 4.1 is now my senior software architect / product manager @openai GPT-4.1 is great at instruction following so I asked it to work with me to build an app It's really good for iterating — great for planning and brainstorming! See how we built a plan together 👇🧵

GPT 4.1 is now my senior software architect / product manager @openai GPT-4.1 is great at instruction following so I asked it to work with me to build an app It's really good for iterating — great for planning and brainstorming! See how we built a plan together 👇🧵

Melvin Vivas

93,141 views • 1 year ago

Software Engineering AI Agent on your machine connects with your apps and tools to automate engineering tasks. It can work with OpenAI, Claude Sonnet 3.5, Gemini and local Llama 3. 100% Opensource and free.

Software Engineering AI Agent on your machine connects with your apps and tools to automate engineering tasks. It can work with OpenAI, Claude Sonnet 3.5, Gemini and local Llama 3. 100% Opensource and free.

Shubham Saboo

169,111 views • 1 year ago

OpenAI Greg Brockman: GPT-5.5 is not an endpoint — it's a beginning It is an early step toward stronger models arriving in the coming months, with larger gains across many capabilities The focus is not just on better benchmarks, but usefulness in the real world, for real users and real-world work

OpenAI Greg Brockman: GPT-5.5 is not an endpoint — it's a beginning It is an early step toward stronger models arriving in the coming months, with larger gains across many capabilities The focus is not just on better benchmarks, but usefulness in the real world, for real users and real-world work

Haider.

40,201 views • 1 month ago

🚨 OpenAI 's own engineers just showed how to actually use OpenAI Codex properly. 60 minutes. free. built by the people who contribute to made it. watch the masterclass. bookmark it. worth more than every $900 coding course you almost bought. you’ve been using Codex like a simple coding tool… while it’s actually a full software engineering system. watch this, it could the best 62 minutes of your life:

🚨 OpenAI 's own engineers just showed how to actually use OpenAI Codex properly. 60 minutes. free. built by the people who contribute to made it. watch the masterclass. bookmark it. worth more than every $900 coding course you almost bought. you’ve been using Codex like a simple coding tool… while it’s actually a full software engineering system. watch this, it could the best 62 minutes of your life:

ZARA

289,995 views • 1 month ago

Most engineering metrics are broken. They count PRs and commits, not real impact. Developer 360 from CodeAnt AI (YC W24) changes that. Built on top of their AI code reviews, it shows what’s actually being built, fixed, or improved— with real visibility into engineering work.

Most engineering metrics are broken. They count PRs and commits, not real impact. Developer 360 from CodeAnt AI (YC W24) changes that. Built on top of their AI code reviews, it shows what’s actually being built, fixed, or improved— with real visibility into engineering work.

Y Combinator

16,374 views • 8 months ago

Andrej Karpathy (Andrej Karpathy) — co-founded OpenAI, led AI at Tesla, coined "vibe coding." In 4 minutes he explains why software is changing - and why Claude Skills, MCP servers, and AI agents aren't hype anymore. They're the foundation of how software gets built from now on. Imo, worth every second (i've added subtitles)👇

Andrej Karpathy (Andrej Karpathy) — co-founded OpenAI, led AI at Tesla, coined "vibe coding." In 4 minutes he explains why software is changing - and why Claude Skills, MCP servers, and AI agents aren't hype anymore. They're the foundation of how software gets built from now on. Imo, worth every second (i've added subtitles)👇

darkzodchi

453,635 views • 2 months ago

Here’s the problem: software interfaces are abstract, hardware is instinctive. We don’t learn objects, we recognize them. Buttons, wheels, friction, position they map to how we exist in the real world. Menus, layers, hidden states they fight against it. This is why this feels different. It’s not an app pretending to be a recorder. It’s a recorder that happens to exist in software. Built in wabi

Here’s the problem: software interfaces are abstract, hardware is instinctive. We don’t learn objects, we recognize them. Buttons, wheels, friction, position they map to how we exist in the real world. Menus, layers, hidden states they fight against it. This is why this feels different. It’s not an app pretending to be a recorder. It’s a recorder that happens to exist in software. Built in wabi

Tykra

30,245 views • 2 months ago