
vincent sunn chen
@vincentsunnchen • 1,457 subscribers
research & founding team @SnorkelAI. previously, @StanfordAILab, @hazyresearch.
Videos

New Benchtalks with John Yang: on ProgramBench (0% frontier models at launch) and the lineage/future of coding benchmarks, from SWE-bench/InterCode to now 01:29 ProgramBench launch and reception 03:41 Why artifact-level evaluation, not code-level 06:03 Why models love Python 08:29 ProgramBench as a research tool 12:45 From SWE-bench & InterCode to ProgramBench 17:47 How to grade a coding model 21:53 The position paper & humans in the loop 25:01 Managing quality with agents-in-the-loop 28:40 Internet access and benchmark integrity 35:26 Where models may surpass human abilities 38:56 When a model hits 80% on ProgramBench 43:55 Benchmarks worth paying attention to 46:24 What benchmark do you wish existed 49:32 Will benchmarks still look like benchmarks in 5 years 52:02 How to contribute to ProgramBench
vincent sunn chen26,128 Aufrufe • vor 17 Tagen

Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works. I talked to Alex Shaw about what happens when model capabilities climb faster than we can measure them. His answer: the benchmark factory (Harbor Framework)— infrastructure to develop hard, representative evals at the pace that the frontier moves. As Alex put it: "we need a thousand times more benchmarks than we have right now." 00:23 - How quickly models hill-climbed TB2 01:46 - What rapid progress reveals about benchmarks vs. real-world capability 03:28 - What made Terminal-Bench stick 04:58 - Why the terminal is the right abstraction for agentic AI 07:14 - How TB2 maintains task quality at scale 09:23 - Managing benchmark integrity in a benchmaxxing world 10:47 - Harbor: from experiment to benchmark factory 12:19 - What Harbor does that nothing else did 14:37 - The invariants: what won't change as agent evals evolve 16:55 - The benchmark Alex most wants to see built 18:18 - The ideal human-in-the-loop task creation flywheel 20:32 - How to contribute to Terminal-Bench 3.0
vincent sunn chen11,500 Aufrufe • vor 2 Monaten
Keine weiteren Inhalte verfügbar