
adarsh
@adarsh_exe • 7,010 subscribers
founder / co-ceo @mercor_ai, prev @harvard
Videos

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. OpenAI GPT 5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1.
adarsh206,236 次观看 • 2 个月前
没有更多内容可加载