Hamel Husain's banner
Hamel Husain's profile picture

Hamel Husain

@HamelHusain46,642 subscribers

Bringing data science back to AI - https://t.co/Zrmp6LRd9c About Me: https://t.co/P6WyeKkyTa

Shorts

Building AI products without evals

Building AI products without evals

50,032 views

Videos

HamelHusain's profile picture

For over a year, Jeremy Howard has been in stealth mode. In this exclusive talk, he showcases what he's been working on. He & Jonathan Whitaker show us SolveIt, a new dev environment and programming paradigm. 🤯 Imagine this workflow: - Build a web app & interact with its UI on the same screen as your code. No more flipping to a separate browser. - Use live variables from your REPL directly in prompts to the AI. The AI knows your current state. - Turn any Python function into an AI tool instantly. No no registering tools or MCP. Just write a function in a cell and tell the AI to use it. This is a live, malleable environment that fuses the best ideas from Literate Programming (Knuth), the live-object world of Smalltalk, and the interactive cells of Jupyter. Who is Solveit for? Jeremy's take: SolveIt is best for programmers who are either very new ( 20 years). Why? Because developers in the middle (3-20 years) often have ingrained workflows and can find this different paradigm confronting. New devs are open-minded and build good habits from scratch, while veterans immediately recognize how this approach solves decades-old problems of complexity and state management. My Thoughts It's early days, but it is super cool. You get all the fun of trying a new programming language without learning new syntax (Python), because it will show you new patterns and ways of doing this. If you are wiling to climb the learning curve it is an extremely powerful tool that you can be productive with on real tasks like writing and coding. I'm personally addicted to it for several workflows and am afraid of losing it tbh. How Can You Try It? Just follow Jeremy Howard - he will announce something in the coming weeks or months (I suspect if this post is popular he might do something soon 🤣 ) TIMESTAMPS (00:00:00) - Introduction (00:00:30) - The SolveIt Method vs. "Vibe Coding" (00:04:00) - Investing in Yourself: Long-Term Skill Building (00:07:45) - Software Engineering vs. Short-Term Gains (00:12:15) - The Problem-Solving Loop: Understand, Plan, Implement, Review (00:18:50) - Example: Literate Programming with the Claudette Library (00:24:15) - First Look at the SolveIt Environment (00:28:34) - Demo Start: Building an Eval for Multimodal Models (00:31:16) - Iterative Development: Exploring the iNaturalist API (00:39:15) - Catching Bugs Instantly by Working Step-by-Step (00:43:37) - Prompting LLMs with Structured Outputs (00:51:54) - Demo: Building a Live Web App Inside SolveIt with FastHTML (01:00:24) - Demo: Exploring a Complex API (Cloudflare) (01:08:45) - Creating Custom AI Agent Tools with Zero Boilerplate (01:19:00) - SolveIt Ergonomics: Modes, Secrets, and Keyboard Shortcuts (01:28:30) - The Power of the SolveIt Community (01:30:45) - Who Should Use SolveIt? (01:34:30) - This is Just the Tip of the Iceberg YT Video and links in reply

Hamel Husain

111,611 views • 9 months ago

HamelHusain's profile picture

New video from Shreya Shankar on data processing with LLMs at scale, an underrated topic! Shreya starts with a real use case: public defenders analyzing case files for racial bias (4:08). Hundreds of pages per defendant. Court transcripts, police reports, news articles. Running GPT-5 on everything costs a fortune. Her solution: treat LLMs like database operators. Semantic Map, Filter, Reduce (9:18). Databricks, BigQuery, and Snowflake are already shipping this as "AI SQL." She discusses how starting at 12:51: a query optimizer for LLMs. Traditional databases rewrite queries for efficiency. Shreya does the same for LLM pipelines (semantic versions of split, map, reduce that are LLM specific, along with query decomposition). For example, trivial LLM calls are replaced with Python functions. These "rewrite directives" improve both cost AND accuracy. She also talks about a cost optimization technique: Task Cascades (30:00). Instead of running GPT-5 on every document, first ask cheap questions. "Is there any lower court mentioned?" If no, the document clearly doesn't overturn a lower court. There are many other routing questions you can ask to reduce the amount of text sent to the LLM. This requires careful optimization and tuning to get right. She explains how to do this in the video. She runs through a production example that achieved 86% cost reduction while retaining 90% accuracy. --- At 41:26, Shreya shifts to HCI. She built DocWrangler, an IDE for LLM pipelines. The design is based on "Three Gulfs" (44:35): 1. Comprehension: You don't know what's in your data 2. Specification: "Only prescription meds" is hard to operationalize 3. Generalization: A prompt that works on 10 examples fails at 10,000 Users invented "throwaway pipelines" just to explore their data before doing real analysis. Pipelines with no analytical purpose: "summarize these documents," "extract key ideas." Just ways to learn what's in their data before doing the real work. DocWrangler makes this a first-class feature. --- In the last bit of the video Shreya discusses why you can't know what "good" means until you see examples. In one study, a medical analyst extracted medications from doctor-patient transcripts. As they inspected outputs, they noticed every medication appeared with a dosage. They hadn't anticipated this. Now they wanted dosages too. They also saw Tylenol and ibuprofen appearing and realized: "Actually, I only want prescription medications." Shreya calls this "criteria drift." Your evaluation criteria evolve as you see more outputs. This matters because standard ML assumes fixed metrics: define them upfront, collect labels, measure. But with LLMs on fuzzy tasks, that assumption breaks. You discover what you actually want through the process of evaluating. If you don't account for criteria drift, you end up optimizing for a stale rubric. DocWrangler and EvalGen accommodate this by placing the human in the loop thoughtfully. Chapter timestamps: (4:08) - The problem: unstructured data at scale (9:18) - Semantic operators (Map, Filter, Reduce) (12:51) - Query optimization for LLMs (18:15) - Data decomposition (chunking) (30:00) - Task Cascades (86% cost reduction) (41:26) - DocWrangler IDE (44:35) - Three Gulfs framework (51:50) - Evaluation criteria drift More links in reply

Hamel Husain

35,391 views • 5 months ago

HamelHusain's profile picture

I'm often asked for the best public example of AI evals done right for a real, production product. I finally have an answer. Teresa Torres shares how she shipped an AI interview coach, and used evals to rapidly squash bugs and improve the product. Teresa shows how she: 1. did error analysis FIRST to find real issues (instead of using generic metrics) 😍 2. used Jupyter notebooks to analyze errors 3. built custom annotation tools + custom widgets in notebooks 4. built a LLM-judge and assertions to test for specific errors 5. iterated through this feedback loop until it worked. 6. kept things simple the whole time It's also probably the best commercial for Jupyter notebooks you can imagine. 🥰 Chapter summary below. Link to YT in next thread 00:00:00 - Intro 00:01:45 - The Product: Building an AI Interview Coach 00:06:34 - The Problem: How Do I Know if My AI Coach is Any Good? 00:10:15 - Using Airtable for Traces and Annotation 00:12:15 - Discovering Jupyter Notebooks and Designing the First Evals 00:15:15 - Example Evals: LLM-as-Judge vs. Code-Based Assertions 00:21:00 - Learning Python with ChatGPT to Analyze Eval Results 00:31:00 - VS Code, Custom Tools, and an Eval Investigation Notebook 00:39:45 - Building a Custom Annotation Tool with Claude 00:41:00 - From Personal Project to Production App 00:46:02 - How Should PMs and Engineers Collaborate on AI Products? 00:55:45 - Q&A: Capturing Feedback and Annotations from End Users 00:58:11 - Q&A: Is a Technical Background Necessary to Build AI? 01:02:28 - Q&A: What's Next for Teresa? 01:03:13 - Q&A: Unpacking the Micro-Decisions of Building an AI App

Hamel Husain

51,376 views • 9 months ago

No more content to load