
Vals AI
@ValsAI • 11,040 subscribers
Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc
Videos

We are excited to share that Logan Kilpatrick joined us on The Bench to discuss Google's new Gemini 3.5 Flash: why it's deliberately more persistent and capable than previous Flash models, how it hit #1 on our FinanceAgent Benchmark taking 82 steps where competitors stopped at 13, and what justifies the price increase. We also get into why AI benchmarks need a paradigm shift, the trade-off of building everything vs staying focused, the Pope, and why Omni might kill the Subway Surfers content era. 0:11:00 – Flash is being rebased for the agent era, not just a cheaper model anymore 0:14:03 – Persistence by design: 82 tool calls vs competitors' 13 0:17:52 – Why pricing went up and how Google thinks about value per token 0:22:55 – Coding performance: from 20th to 10th place in one generation 0:28:28 – Why benchmarks have historically been misleading and what the new era of evaluation looks like 0:29:28 Logan on why Google has the best researchers in the world 0:36:16 – The cost of being Google 0:39:07 – The Pope’s encyclical on AI and whether most people see frontier intelligence as a good thing 0:51:12 – Why Omni is the thing that recently clicked
Vals AI15,581 görüntüleme • 7 gün önce

Finance Agent Benchmark v2 is here. Finance is one of the most lucrative applications of AI where much of the busy work could be automated. That’s why we rebuilt our Finance Agent Benchmark to push frontier models even further. We designed V2 to better reflect what financial analysts actually do: refined taxonomy reflecting real workflows, an improved harness with more tools, and jury-based evaluation. The result: no model cracks 52%. Would you trust a financial analyst who’s only correct half the time?
Vals AI10,687 görüntüleme • 24 gün önce
Daha fazla içerik yok.