
Arsh Shah Dilbagi
@arshdilbagi • 2,584 subscribers
https://t.co/C3UaMh1UYM
Videos

Introducing Adaline 2.0 - The Agent Self-Improvement Layer Adaline turns Traces into Behaviors, Behaviors surface Issues, Issues become auto-generated Evals + Data, Adaline then generates new agent candidates and tests them. You review the winners and ship!
Arsh Shah Dilbagi917,095 просмотров • 20 дней назад

The hard part about LLM failures is that their outputs rarely look like failures. The demo “works.” The output sounds coherent. The user actively uses the product. And your dashboard looks normal. Meanwhile, the system can be wrong, unsafe, or quietly driving up token spend. And you won’t notice until the damage adds up. Prompts often serve as business logic (policies, safety, and product context). But many teams ship them without the basics, such as versioning, reviewable changes, end-to-end traces, and eval gates. In production, it doesn’t crash. It degrades via wrong answers, policy misses, and surprise spending. No crash. No error. No alert. I cover this exact issue in my Stanford University CS 224G guest lecture on AI Observability and Evaluations. Here are the core ideas: • If you only log the final output, you’re guessing. Full traces show where it broke. • Evals are feedback loops. Use clear pass/fail criteria tied to outcomes. • Run evals continuously on production traces and don’t wait for support tickets. The moat isn’t prompt cleverness. It’s a measured improvement. Full lecture + blog below 👇
Arsh Shah Dilbagi20,833 просмотров • 4 месяцев назад
Больше нет контента для загрузки