Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. OpenAI... show more

adarsh

7,236 subscribers

212,522 views • 3 months ago •via X (Twitter)

Education Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Wave 9 is here: a frontier model built for software engineering. Introducing our new family of models: SWE-1, SWE-1-lite, and SWE-1-mini. Based on internal evals, it has performance nearing that of frontier models from the foundation labs. Available now, only in Windsurf!

Wave 9 is here: a frontier model built for software engineering. Introducing our new family of models: SWE-1, SWE-1-lite, and SWE-1-mini. Based on internal evals, it has performance nearing that of frontier models from the foundation labs. Available now, only in Windsurf!

Devin Desktop

742,703 views • 1 year ago

Vibe coding is causing the death of SaaS... Because we want better software: more features, more workflows, more EVERYTHING We're not happy with existing software, that's why we vibe code something ourselves. But what if... existing software came built-in with an AI vibe coding layer? If it did: that software can sell more, beat AI-first companies, and reduce churn. We've already proven this works - that's how we added $1,000,000 in sales and prevented $120,000 in churn for our Series B customers. Gigacatalyst embeds a vibe coding layer inside your software. See how it works: If you could improve any daily software that you use, what would you change?

Vibe coding is causing the death of SaaS... Because we want better software: more features, more workflows, more EVERYTHING We're not happy with existing software, that's why we vibe code something ourselves. But what if... existing software came built-in with an AI vibe coding layer? If it did: that software can sell more, beat AI-first companies, and reduce churn. We've already proven this works - that's how we added $1,000,000 in sales and prevented $120,000 in churn for our Series B customers. Gigacatalyst embeds a vibe coding layer inside your software. See how it works: If you could improve any daily software that you use, what would you change?

Namanyay

19,446 views • 1 month ago

When the way software is built changes, we help companies adopt it. For over a decade, that meant bringing React Native to production across mobile, web, desktop, and TV. Now we are expanding that same mission to AI-native engineering: helping teams ship software with agents, safely and at scale. This is our new manifesto.

When the way software is built changes, we help companies adopt it. For over a decade, that meant bringing React Native to production across mobile, web, desktop, and TV. Now we are expanding that same mission to AI-native engineering: helping teams ship software with agents, safely and at scale. This is our new manifesto.

Callstack Engineers

23,043 views • 1 month ago

Are AI agents ready to be your virtual coworker? Can they write your emails, build Excel models, and create slide decks? Introducing APEX-Agents, a frontier benchmark that tests how well AI agents complete real, long-horizon professional services deliverables in Google Workspace. Mercor APEX-Agents Pass@1 leaderboard: 🥇 Google Gemini 3 Flash (High): 24.0% 🥈 OpenAI GPT-5.2 (High): 23.0% 🥉 Paul Jankura Claude Opus 4.5 (High): 18.4%

Are AI agents ready to be your virtual coworker? Can they write your emails, build Excel models, and create slide decks? Introducing APEX-Agents, a frontier benchmark that tests how well AI agents complete real, long-horizon professional services deliverables in Google Workspace. Mercor APEX-Agents Pass@1 leaderboard: 🥇 Google Gemini 3 Flash (High): 24.0% 🥈 OpenAI GPT-5.2 (High): 23.0% 🥉 Paul Jankura Claude Opus 4.5 (High): 18.4%

Brendan (can/do)

145,326 views • 6 months ago

Building a star on Earth is a big engineering challenge. To tackle it, we’ve partnered with two industry powerhouses to create a high-fidelity digital twin of our #SPARC fusion machine. Here’s how this partnership is coming to life: ⚡️ Siemens: Data and Design — #Siemens software like Teamcenter and Designcenter NX let us manage SPARC’s complex design, housing data for our digital twin and handling traditional computer-aided design work. ⚡️ NVIDIA: AI Accelerator — NVIDIA’s Omniverse library provides a visually intuitive, 3D mirror of real-world SPARC that could improve planning and operations. And #NVIDIA software tools can help us build AI models of physics simulations that run enormously faster than traditional software. ⚡️ CFS: The Fusion Frontier — Integrating real-world and simulation data will let us benefit from AI-boosted physics software and compare its results with real-world data from how SPARC actually operates. Dive deeper into the details here and learn more about how our power trio is accelerating commercial fusion: #FusionEnergy #DigitalTwin #AI

Building a star on Earth is a big engineering challenge. To tackle it, we’ve partnered with two industry powerhouses to create a high-fidelity digital twin of our #SPARC fusion machine. Here’s how this partnership is coming to life: ⚡️ Siemens: Data and Design — #Siemens software like Teamcenter and Designcenter NX let us manage SPARC’s complex design, housing data for our digital twin and handling traditional computer-aided design work. ⚡️ NVIDIA: AI Accelerator — NVIDIA’s Omniverse library provides a visually intuitive, 3D mirror of real-world SPARC that could improve planning and operations. And #NVIDIA software tools can help us build AI models of physics simulations that run enormously faster than traditional software. ⚡️ CFS: The Fusion Frontier — Integrating real-world and simulation data will let us benefit from AI-boosted physics software and compare its results with real-world data from how SPARC actually operates. Dive deeper into the details here and learn more about how our power trio is accelerating commercial fusion: #FusionEnergy #DigitalTwin #AI

Commonwealth Fusion Systems

13,875 views • 6 months ago

Can AI actually automate jobs? Scale AI and are launching the Remote Labor Index (RLI), the first benchmark and public leaderboard that test how well AI agents can complete real, paid freelance work in domains like software engineering, design, architecture, data analysis, and more. Early results show the limits of today’s models. The top AI agent completed just 2.5% of real freelance jobs better than humans. AI is powerful, but not yet reliable enough to replace skilled labor. RLI gives us a transparent way to track progress over time and bring clarity to the future of work.

Can AI actually automate jobs? Scale AI and are launching the Remote Labor Index (RLI), the first benchmark and public leaderboard that test how well AI agents can complete real, paid freelance work in domains like software engineering, design, architecture, data analysis, and more. Early results show the limits of today’s models. The top AI agent completed just 2.5% of real freelance jobs better than humans. AI is powerful, but not yet reliable enough to replace skilled labor. RLI gives us a transparent way to track progress over time and bring clarity to the future of work.

Bing Liu

424,454 views • 8 months ago

Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser. When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted. Check out what Devin can do in the thread below.

Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser. When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted. Check out what Devin can do in the thread below.

Cognition

31,446,907 views • 2 years ago

GPT 4.1 is now my senior software architect / product manager @openai GPT-4.1 is great at instruction following so I asked it to work with me to build an app It's really good for iterating — great for planning and brainstorming! See how we built a plan together 👇🧵

GPT 4.1 is now my senior software architect / product manager @openai GPT-4.1 is great at instruction following so I asked it to work with me to build an app It's really good for iterating — great for planning and brainstorming! See how we built a plan together 👇🧵

Melvin Vivas

93,141 views • 1 year ago

The new Factory software AI agents just combined OpenAI’s Codex, Devin, and Cursor into one product. They are fully autonomous agents called Droids that perform deep code research, technical docs, and do autonomous end-to-end coding. Here’s how it works (with real examples)👇:

The new Factory software AI agents just combined OpenAI’s Codex, Devin, and Cursor into one product. They are fully autonomous agents called Droids that perform deep code research, technical docs, and do autonomous end-to-end coding. Here’s how it works (with real examples)👇:

Alvaro Cintas

94,866 views • 1 year ago

OpenAI Greg Brockman: GPT-5.5 is not an endpoint — it's a beginning It is an early step toward stronger models arriving in the coming months, with larger gains across many capabilities The focus is not just on better benchmarks, but usefulness in the real world, for real users and real-world work

OpenAI Greg Brockman: GPT-5.5 is not an endpoint — it's a beginning It is an early step toward stronger models arriving in the coming months, with larger gains across many capabilities The focus is not just on better benchmarks, but usefulness in the real world, for real users and real-world work

Haider.

40,201 views • 2 months ago

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

Yiyou Sun

95,103 views • 1 month ago

Here’s the problem: software interfaces are abstract, hardware is instinctive. We don’t learn objects, we recognize them. Buttons, wheels, friction, position they map to how we exist in the real world. Menus, layers, hidden states they fight against it. This is why this feels different. It’s not an app pretending to be a recorder. It’s a recorder that happens to exist in software. Built in wabi

Here’s the problem: software interfaces are abstract, hardware is instinctive. We don’t learn objects, we recognize them. Buttons, wheels, friction, position they map to how we exist in the real world. Menus, layers, hidden states they fight against it. This is why this feels different. It’s not an app pretending to be a recorder. It’s a recorder that happens to exist in software. Built in wabi

Tykra

30,343 views • 3 months ago

I built dev tools at GitHub for 7 years before AI coding became the default. AI made code gen fast, but software still depends on the human mind knowing which details matter. AI only made that harder. That's why I built Nuanced, the AI coding app I want. Nuanced is a macOS app for building software with agents around a living spec, keeping intent and implementation connected as the agent builds. Try it:

I built dev tools at GitHub for 7 years before AI coding became the default. AI made code gen fast, but software still depends on the human mind knowing which details matter. AI only made that harder. That's why I built Nuanced, the AI coding app I want. Nuanced is a macOS app for building software with agents around a living spec, keeping intent and implementation connected as the agent builds. Try it:

ayman nadeem

12,552 views • 26 days ago

AI Productivity Index (APEX) is the first benchmark that measures how well AI models perform real-world, economically valuable work. We partnered with former Treasury Secretary Larry Summers (Lawrence H. Summers), former McKinsey managing partner Dominic Barton, legal scholar Cass Sunstein (Cass Sunstein), cardiologist Dr. Eric Topol (Eric Topol), and dozens of other experts to evaluate model performance on real deliverables across law, finance, consulting, and medicine. Check it out:

AI Productivity Index (APEX) is the first benchmark that measures how well AI models perform real-world, economically valuable work. We partnered with former Treasury Secretary Larry Summers (Lawrence H. Summers), former McKinsey managing partner Dominic Barton, legal scholar Cass Sunstein (Cass Sunstein), cardiologist Dr. Eric Topol (Eric Topol), and dozens of other experts to evaluate model performance on real deliverables across law, finance, consulting, and medicine. Check it out:

Mercor

630,943 views • 9 months ago

Imagine a computer where you don’t need to learn 10 apps to get work done. You just tell it what you want, and it adapts to how you work. I tested Happycapy with a real use case and created an image and a video. No coding, no complex software. Do this: - Build automations that run on schedule - Deploy agent teams that work for you AI molds how you work, not the other way around. If you can use a computer, you can make anything happen. Start your first Automation and Agent teams today.

Imagine a computer where you don’t need to learn 10 apps to get work done. You just tell it what you want, and it adapts to how you work. I tested Happycapy with a real use case and created an image and a video. No coding, no complex software. Do this: - Build automations that run on schedule - Deploy agent teams that work for you AI molds how you work, not the other way around. If you can use a computer, you can make anything happen. Start your first Automation and Agent teams today.

Aaliya

17,042 views • 5 months ago

Databricks is excited to partner with OpenAI on GPT-5.5, their latest frontier model. GPT-5.5 will be available in Unity AI Gateway on launch. You can use it with coding tools such as Codex, or to power your enterprise agents. GPT-5.5 is state-of-the-art on many benchmarks including OfficeQA Pro, our benchmark for evaluating grounded reasoning on enterprise tasks. We are partnering with OpenAI to co-launch on Databricks. Hear more from our co-founder Patrick Wendell and OpenAI CRO Denise Holland Dresser on GPT-5.5 in Databricks.

Databricks is excited to partner with OpenAI on GPT-5.5, their latest frontier model. GPT-5.5 will be available in Unity AI Gateway on launch. You can use it with coding tools such as Codex, or to power your enterprise agents. GPT-5.5 is state-of-the-art on many benchmarks including OfficeQA Pro, our benchmark for evaluating grounded reasoning on enterprise tasks. We are partnering with OpenAI to co-launch on Databricks. Hear more from our co-founder Patrick Wendell and OpenAI CRO Denise Holland Dresser on GPT-5.5 in Databricks.

Databricks

12,707 views • 2 months ago

🚨 OpenAI just launched Codex, a brand-new autonomous coding agent that can build features and fix bugs on its own. We’ve been using it Every 📧 for a few days, and I’m impressed. I invited Alexander Embiricos (ben davies), a member of the product staff responsible for Codex, to demo Codex and talk about it live on a special edition of AI & I: What Codex is and how it works Codex is designed to be used by senior engineers—it performs coding tasks like adding features or fixing bugs autonomously. It's built to allow you to start many sessions at once, so you can have multiple agents working in parallel. Codex is built to have "taste" OpenAI trained Codex to have the taste of a senior software engineer. It knows how big codebases work, how to write a good PR, and uses clean, minimal code. Why an “abundance mindset” is best for interacting with agents Codex is designed to allow users to delegate many tasks at once without getting caught up in the details. This lets you point an abundance of agents at a specific task like a difficult bug—it’s worth it even if only one of them succeeds. How OpenAI is thinking about agents Codex is one piece of a unified super-assistant OpenAI wants to eventually build—an agent that helps users easily get things done by selecting the right tools for them behind the scenes. OpenAI’s vision for the future of programming In the future developers will probably spend less time writing routine code and more time guiding agents, reviewing their work, and making strategy decisions. Programming will become more social, letting teams easily delegate multiple tasks at once, allowing people to focus on ideas and collaboration instead of routine coding. Watch below!

Dan Shipper 📧

145,487 views • 1 year ago

The new bar for software: "can I vibecode this?" We’re heading into a world of personal software — built for one person and their needs — and premium software, designed with such depth that you’re buying a solution, not a tool. Most software today sits in the middle. And the middle is death. As the market catches up to our new reality, it won’t just be the ai bubble popping — it’ll be a lot of traditional SaaS getting wiped out (From my recent Kinference talk.)

The new bar for software: "can I vibecode this?" We’re heading into a world of personal software — built for one person and their needs — and premium software, designed with such depth that you’re buying a solution, not a tool. Most software today sits in the middle. And the middle is death. As the market catches up to our new reality, it won’t just be the ai bubble popping — it’ll be a lot of traditional SaaS getting wiped out (From my recent Kinference talk.)

Carl Rivera

98,551 views • 8 months ago

Chips are the foundation of every AI experience. That's why understanding the hardware behind the software has never mattered more. When you ask a chatbot a question or generate an image, there's a physical chip somewhere doing the heavy lifting. For most of computing history, that was a CPU—the "brain" of a computer—great for general tasks needed to run software and operating systems. AI is more complex: for workloads called training and inference, AI needs to perform trillions of calculations in parallel. That's where AI accelerators come in. Purpose-built accelerators can deliver significantly better performance and efficiency than general-purpose chips. Amazon Web Services Trainium chips are an example—purpose-built for AI training and inference. New chips for a new era. ⬇️

Chips are the foundation of every AI experience. That's why understanding the hardware behind the software has never mattered more. When you ask a chatbot a question or generate an image, there's a physical chip somewhere doing the heavy lifting. For most of computing history, that was a CPU—the "brain" of a computer—great for general tasks needed to run software and operating systems. AI is more complex: for workloads called training and inference, AI needs to perform trillions of calculations in parallel. That's where AI accelerators come in. Purpose-built accelerators can deliver significantly better performance and efficiency than general-purpose chips. Amazon Web Services Trainium chips are an example—purpose-built for AI training and inference. New chips for a new era. ⬇️

Amazon

27,151 views • 1 month ago