
Russ Salakhutdinov
@rsalakhu • 112,224 subscribers
CSO @ Sooth Labs, Professor @ CMU, President Elect ICML Board, Ex-VP of Research @ Meta (Multimodal LLMs, AI Agents), ex-Director of AI at @Apple.
Videos

How well do today’s frontier models handle long-horizon, multi-step web agent tasks, such as identifying the top 25 U.S. CS PhD programs with ML/AI faculty likely accepting students and compiling the results into a structured sheet? Check out our new work on Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks Paper: Leaderboard: We introduce Odysseys, a benchmark of 200 long-horizon tasks derived from real browsing sessions and evaluated on the live Internet. We show that binary pass/fail is inadequate in this setting and propose rubric-based evaluation, which better aligns with human judgment and provides more informative signals. Across leading models, the best achieves only 44.5% success, leaving substantial headroom. We further introduce a Trajectory Efficiency metric (rubric score per step) and find efficiency remains extremely low (1.15%), highlighting a key bottleneck. Odysseys provides a realistic benchmark for measuring progress toward web agents capable of sustained, efficient, real-world operation. See a more detailed thread by Jing Yu Koh.
Russ Salakhutdinov22,518 görüntüleme • 1 ay önce
Daha fazla içerik yok.