Yuchen Zeng's banner
Yuchen Zeng's profile picture

Yuchen Zeng

@yzeng581,198 subscribers

Researcher @MSFTResearch, AI Frontiers Lab | Reasoning, Agent | Previously @Meta @MSFT_GSL @MITIBMLab @WisconsinCS

Shorts

💻Tired of running so many slow, expensive benchmark evals across every checkpoint? Try ✨BenchPress✨ at provide a few benchmark scores, then get predictions for the remaining ~100 benchmarks, with trust probabilities and calibrated 90% prediction intervals. How does this work? In his original post ( Dimitris Papailiopoulos first tried the idea as a fun question: collect model-by-benchmark scores into a matrix, find its low-rank structure, and use matrix completion to predict missing benchmark scores from a few observed ones. We expanded this into a full system: a fully audited 84-model x 133-benchmark score matrix, an optimized matrix-completion predictor, and a reliability layer for trust probabilities and 90% prediction intervals. Beyond predicting missing scores, we also suggest practical seed benchmark sets. The five-probe set {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} recovers the rest of a model's public score profile with a MedAE of 3.93 points. A lower-cost set {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} reaches 4.55 points. See more details below 🧵1/7 This work is with Dimitris Papailiopoulos at AI Frontiers, a boutique research lab inside Microsoft Research.

💻Tired of running so many slow, expensive benchmark evals across every checkpoint? Try ✨BenchPress✨ at provide a few benchmark scores, then get predictions for the remaining ~100 benchmarks, with trust probabilities and calibrated 90% prediction intervals. How does this work? In his original post ( Dimitris Papailiopoulos first tried the idea as a fun question: collect model-by-benchmark scores into a matrix, find its low-rank structure, and use matrix completion to predict missing benchmark scores from a few observed ones. We expanded this into a full system: a fully audited 84-model x 133-benchmark score matrix, an optimized matrix-completion predictor, and a reliability layer for trust probabilities and 90% prediction intervals. Beyond predicting missing scores, we also suggest practical seed benchmark sets. The five-probe set {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} recovers the rest of a model's public score profile with a MedAE of 3.93 points. A lower-cost set {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} reaches 4.55 points. See more details below 🧵1/7 This work is with Dimitris Papailiopoulos at AI Frontiers, a boutique research lab inside Microsoft Research.

27,665 просмотров