Vals AI's banner

Vals AI

@ValsAI • 13,757 subscribers

Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

When will MiniMax have a Fable level model? Olive Song, research lead at MiniMax (official), joins The Bench to answer that question, plus M3, open source models closing the gap on Vibe Code Bench, and the truth about 996 culture. Full episode out now! 7:19 – Inside MiniMax's research culture: AI agents tracking papers since early last year 18:01- The headline of M3: native multimodality and one million tokens of context 21:10- The bottleneck of building models is people, not compute 22:48 – Does MiniMax work 996? 25:47 - Olive's prediction and the 30 point gap on Vibe Code Bench 32:02- Model capabilities Olive is most excited about 32:35- The lab Olive most respects outside of MiniMax 36:24- Will more labs go open source a year from now?

When will MiniMax have a Fable level model? Olive Song, research lead at MiniMax (official), joins The Bench to answer that question, plus M3, open source models closing the gap on Vibe Code Bench, and the truth about 996 culture. Full episode out now! 7:19 – Inside MiniMax's research culture: AI agents tracking papers since early last year 18:01- The headline of M3: native multimodality and one million tokens of context 21:10- The bottleneck of building models is people, not compute 22:48 – Does MiniMax work 996? 25:47 - Olive's prediction and the 30 point gap on Vibe Code Bench 32:02- Model capabilities Olive is most excited about 32:35- The lab Olive most respects outside of MiniMax 36:24- Will more labs go open source a year from now?

40,769 görüntüleme • 3 gün önce

The verdict is in! Frontier models can pass the bar, yet they struggle on comprehensive legal research Today we're releasing Legal Research Bench, a benchmark that measures models’ ability to solve realistic legal research tasks across eight areas of U.S. law Instead of awarding partial credit, Legal Research Bench measures whether a model can conduct exhaustive legal analysis. We grade against a strict, all-pass rubric written by practicing lawyers. A model only receives full credit if every required legal element is correct Claude Opus 4.8 leads with 43.8% all-pass accuracy, followed by GPT 5.5 (40.4%) and Claude Sonnet 4.6 (38.5%). While top models score around 80% with partial credit, none exceed 44% when every required legal element must be correct The gap between partial and all-pass accuracy shows how difficult it remains for AI to produce complete, reliable legal research. We hope that Legal Research Bench helps better measure, and ultimately close that gap Lots of exciting work happening in Legal AI from Harvey and Crosby. Excited for the legal research benchmarks ahead!

The verdict is in! Frontier models can pass the bar, yet they struggle on comprehensive legal research Today we're releasing Legal Research Bench, a benchmark that measures models’ ability to solve realistic legal research tasks across eight areas of U.S. law Instead of awarding partial credit, Legal Research Bench measures whether a model can conduct exhaustive legal analysis. We grade against a strict, all-pass rubric written by practicing lawyers. A model only receives full credit if every required legal element is correct Claude Opus 4.8 leads with 43.8% all-pass accuracy, followed by GPT 5.5 (40.4%) and Claude Sonnet 4.6 (38.5%). While top models score around 80% with partial credit, none exceed 44% when every required legal element must be correct The gap between partial and all-pass accuracy shows how difficult it remains for AI to produce complete, reliable legal research. We hope that Legal Research Bench helps better measure, and ultimately close that gap Lots of exciting work happening in Legal AI from Harvey and Crosby. Excited for the legal research benchmarks ahead!

17,133 görüntüleme • 27 gün önce

AI is creating problems it still can’t solve. The same technology poised to automate millions of jobs still can’t reliably help people navigate SNAP — the food assistance program 40 million Americans depend on. We built the first benchmark to measure that. Partnering with Center for Civic Futures and Code for America , we scored models on SNAP question scenarios which users would have to navigate, with expected response rubrics validated by policy experts to match practice considerations. The best model only scored 62%. Models handle federal questions like appeals and recertification reasonably well, but fall short on state-specific ones like replacing an EBT card. Benefits are administered by the states, meaning models are weakest where people need them most. The same technology poised to automate millions of jobs should at least help strengthen the social safety net for the people it could displace. As AI adoption proliferates into public services, governments need a reliable way to test these tools before deploying them.

AI is creating problems it still can’t solve. The same technology poised to automate millions of jobs still can’t reliably help people navigate SNAP — the food assistance program 40 million Americans depend on. We built the first benchmark to measure that. Partnering with Center for Civic Futures and Code for America , we scored models on SNAP question scenarios which users would have to navigate, with expected response rubrics validated by policy experts to match practice considerations. The best model only scored 62%. Models handle federal questions like appeals and recertification reasonably well, but fall short on state-specific ones like replacing an EBT card. Benefits are administered by the states, meaning models are weakest where people need them most. The same technology poised to automate millions of jobs should at least help strengthen the social safety net for the people it could displace. As AI adoption proliferates into public services, governments need a reliable way to test these tools before deploying them.

20,737 görüntüleme • 1 ay önce

We’re releasing our Code Migration benchmark — and we managed to get Fable tested in time Code migration carries real economic weight. COBOL powers banks, payrolls, government services, and underpins nearly 95% of US ATM transactions. The danger with any migration is that a model ships code that looks right but quietly drops essential behaviors We evaluated models in three ways: modern to modern migrations, legacy to modern migrations and on their overall code quality. Each model rebuilds the program in an offline sandbox, then is scored on a hidden behavior test with anti-cheat checks that catch anything wrapping the original, copying reference files, or staying in the source language Fable 5 leads overall at 55%, but costs $115.43 per test, while Opus 4.8 (47%), and GPT 5.5 (45%) cost $30.51 and $6.44, respectively, making GPT 5.5 the most cost efficient model. Kimi K2.6 is the #1 open-weight model (28%) priced at $5.12, ranking above some frontier models

We’re releasing our Code Migration benchmark — and we managed to get Fable tested in time Code migration carries real economic weight. COBOL powers banks, payrolls, government services, and underpins nearly 95% of US ATM transactions. The danger with any migration is that a model ships code that looks right but quietly drops essential behaviors We evaluated models in three ways: modern to modern migrations, legacy to modern migrations and on their overall code quality. Each model rebuilds the program in an offline sandbox, then is scored on a hidden behavior test with anti-cheat checks that catch anything wrapping the original, copying reference files, or staying in the source language Fable 5 leads overall at 55%, but costs $115.43 per test, while Opus 4.8 (47%), and GPT 5.5 (45%) cost $30.51 and $6.44, respectively, making GPT 5.5 the most cost efficient model. Kimi K2.6 is the #1 open-weight model (28%) priced at $5.12, ranking above some frontier models

16,686 görüntüleme • 1 ay önce

Stop vibe checking your vibe code! We just released Vibe Code Bench: the first benchmark that tests whether AI models can actually build complete web applications from scratch. Featured today in Inc. (1/6)

Stop vibe checking your vibe code! We just released Vibe Code Bench: the first benchmark that tests whether AI models can actually build complete web applications from scratch. Featured today in Inc. (1/6)

57,126 görüntüleme • 8 ay önce

We are excited to share that Logan Kilpatrick joined us on The Bench to discuss Google's new Gemini 3.5 Flash: why it's deliberately more persistent and capable than previous Flash models, how it hit #1 on our FinanceAgent Benchmark taking 82 steps where competitors stopped at 13, and what justifies the price increase. We also get into why AI benchmarks need a paradigm shift, the trade-off of building everything vs staying focused, the Pope, and why Omni might kill the Subway Surfers content era. 0:11:00 – Flash is being rebased for the agent era, not just a cheaper model anymore 0:14:03 – Persistence by design: 82 tool calls vs competitors' 13 0:17:52 – Why pricing went up and how Google thinks about value per token 0:22:55 – Coding performance: from 20th to 10th place in one generation 0:28:28 – Why benchmarks have historically been misleading and what the new era of evaluation looks like 0:29:28 Logan on why Google has the best researchers in the world 0:36:16 – The cost of being Google 0:39:07 – The Pope’s encyclical on AI and whether most people see frontier intelligence as a good thing 0:51:12 – Why Omni is the thing that recently clicked

We are excited to share that Logan Kilpatrick joined us on The Bench to discuss Google's new Gemini 3.5 Flash: why it's deliberately more persistent and capable than previous Flash models, how it hit #1 on our FinanceAgent Benchmark taking 82 steps where competitors stopped at 13, and what justifies the price increase. We also get into why AI benchmarks need a paradigm shift, the trade-off of building everything vs staying focused, the Pope, and why Omni might kill the Subway Surfers content era. 0:11:00 – Flash is being rebased for the agent era, not just a cheaper model anymore 0:14:03 – Persistence by design: 82 tool calls vs competitors' 13 0:17:52 – Why pricing went up and how Google thinks about value per token 0:22:55 – Coding performance: from 20th to 10th place in one generation 0:28:28 – Why benchmarks have historically been misleading and what the new era of evaluation looks like 0:29:28 Logan on why Google has the best researchers in the world 0:36:16 – The cost of being Google 0:39:07 – The Pope’s encyclical on AI and whether most people see frontier intelligence as a good thing 0:51:12 – Why Omni is the thing that recently clicked

16,486 görüntüleme • 1 ay önce

Finance Agent Benchmark v2 is here. Finance is one of the most lucrative applications of AI where much of the busy work could be automated. That’s why we rebuilt our Finance Agent Benchmark to push frontier models even further. We designed V2 to better reflect what financial analysts actually do: refined taxonomy reflecting real workflows, an improved harness with more tools, and jury-based evaluation. The result: no model cracks 52%. Would you trust a financial analyst who’s only correct half the time?

Finance Agent Benchmark v2 is here. Finance is one of the most lucrative applications of AI where much of the busy work could be automated. That’s why we rebuilt our Finance Agent Benchmark to push frontier models even further. We designed V2 to better reflect what financial analysts actually do: refined taxonomy reflecting real workflows, an improved harness with more tools, and jury-based evaluation. The result: no model cracks 52%. Would you trust a financial analyst who’s only correct half the time?

11,248 görüntüleme • 2 ay önce

Daha fazla içerik yok.