Arsh Shah Dilbagi's banner

Arsh Shah Dilbagi

@arshdilbagi • 2,584 subscribers

https://t.co/C3UaMh1UYM

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Introducing Adaline 2.0 - The Agent Self-Improvement Layer Adaline turns Traces into Behaviors, Behaviors surface Issues, Issues become auto-generated Evals + Data, Adaline then generates new agent candidates and tests them. You review the winners and ship!

Introducing Adaline 2.0 - The Agent Self-Improvement Layer Adaline turns Traces into Behaviors, Behaviors surface Issues, Issues become auto-generated Evals + Data, Adaline then generates new agent candidates and tests them. You review the winners and ship!

Arsh Shah Dilbagi

917,095 次观看 • 20 天前

The hard part about LLM failures is that their outputs rarely look like failures. The demo “works.” The output sounds coherent. The user actively uses the product. And your dashboard looks normal. Meanwhile, the system can be wrong, unsafe, or quietly driving up token spend. And you won’t notice until the damage adds up. Prompts often serve as business logic (policies, safety, and product context). But many teams ship them without the basics, such as versioning, reviewable changes, end-to-end traces, and eval gates. In production, it doesn’t crash. It degrades via wrong answers, policy misses, and surprise spending. No crash. No error. No alert. I cover this exact issue in my Stanford University CS 224G guest lecture on AI Observability and Evaluations. Here are the core ideas: • If you only log the final output, you’re guessing. Full traces show where it broke. • Evals are feedback loops. Use clear pass/fail criteria tied to outcomes. • Run evals continuously on production traces and don’t wait for support tickets. The moat isn’t prompt cleverness. It’s a measured improvement. Full lecture + blog below 👇

The hard part about LLM failures is that their outputs rarely look like failures. The demo “works.” The output sounds coherent. The user actively uses the product. And your dashboard looks normal. Meanwhile, the system can be wrong, unsafe, or quietly driving up token spend. And you won’t notice until the damage adds up. Prompts often serve as business logic (policies, safety, and product context). But many teams ship them without the basics, such as versioning, reviewable changes, end-to-end traces, and eval gates. In production, it doesn’t crash. It degrades via wrong answers, policy misses, and surprise spending. No crash. No error. No alert. I cover this exact issue in my Stanford University CS 224G guest lecture on AI Observability and Evaluations. Here are the core ideas: • If you only log the final output, you’re guessing. Full traces show where it broke. • Evals are feedback loops. Use clear pass/fail criteria tied to outcomes. • Run evals continuously on production traces and don’t wait for support tickets. The moat isn’t prompt cleverness. It’s a measured improvement. Full lecture + blog below 👇

Arsh Shah Dilbagi

20,833 次观看 • 4 个月前

没有更多内容可加载