Hamel Husain's banner

Hamel Husain

@HamelHusain • 46,642 subscribers

Bringing data science back to AI - https://t.co/Zrmp6LRd9c About Me: https://t.co/P6WyeKkyTa

Shorts

Building AI products without evals

Building AI products without evals

50,032 views

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

"Open AI is not going to make it" Jeremy Howard called it yesterday 🤯 Really prescient. h/t Vanishing Gradients Podcast podcast

"Open AI is not going to make it" Jeremy Howard called it yesterday 🤯 Really prescient. h/t Vanishing Gradients Podcast podcast

1,140,696 views • 2 years ago

Made this video to explain evals

Made this video to explain evals

85,431 views • 7 months ago

For over a year, Jeremy Howard has been in stealth mode. In this exclusive talk, he showcases what he's been working on. He & Jonathan Whitaker show us SolveIt, a new dev environment and programming paradigm. 🤯 Imagine this workflow: - Build a web app & interact with its UI on the same screen as your code. No more flipping to a separate browser. - Use live variables from your REPL directly in prompts to the AI. The AI knows your current state. - Turn any Python function into an AI tool instantly. No no registering tools or MCP. Just write a function in a cell and tell the AI to use it. This is a live, malleable environment that fuses the best ideas from Literate Programming (Knuth), the live-object world of Smalltalk, and the interactive cells of Jupyter. Who is Solveit for? Jeremy's take: SolveIt is best for programmers who are either very new ( 20 years). Why? Because developers in the middle (3-20 years) often have ingrained workflows and can find this different paradigm confronting. New devs are open-minded and build good habits from scratch, while veterans immediately recognize how this approach solves decades-old problems of complexity and state management. My Thoughts It's early days, but it is super cool. You get all the fun of trying a new programming language without learning new syntax (Python), because it will show you new patterns and ways of doing this. If you are wiling to climb the learning curve it is an extremely powerful tool that you can be productive with on real tasks like writing and coding. I'm personally addicted to it for several workflows and am afraid of losing it tbh. How Can You Try It? Just follow Jeremy Howard - he will announce something in the coming weeks or months (I suspect if this post is popular he might do something soon 🤣 ) TIMESTAMPS (00:00:00) - Introduction (00:00:30) - The SolveIt Method vs. "Vibe Coding" (00:04:00) - Investing in Yourself: Long-Term Skill Building (00:07:45) - Software Engineering vs. Short-Term Gains (00:12:15) - The Problem-Solving Loop: Understand, Plan, Implement, Review (00:18:50) - Example: Literate Programming with the Claudette Library (00:24:15) - First Look at the SolveIt Environment (00:28:34) - Demo Start: Building an Eval for Multimodal Models (00:31:16) - Iterative Development: Exploring the iNaturalist API (00:39:15) - Catching Bugs Instantly by Working Step-by-Step (00:43:37) - Prompting LLMs with Structured Outputs (00:51:54) - Demo: Building a Live Web App Inside SolveIt with FastHTML (01:00:24) - Demo: Exploring a Complex API (Cloudflare) (01:08:45) - Creating Custom AI Agent Tools with Zero Boilerplate (01:19:00) - SolveIt Ergonomics: Modes, Secrets, and Keyboard Shortcuts (01:28:30) - The Power of the SolveIt Community (01:30:45) - Who Should Use SolveIt? (01:34:30) - This is Just the Tip of the Iceberg YT Video and links in reply

For over a year, Jeremy Howard has been in stealth mode. In this exclusive talk, he showcases what he's been working on. He & Jonathan Whitaker show us SolveIt, a new dev environment and programming paradigm. 🤯 Imagine this workflow: - Build a web app & interact with its UI on the same screen as your code. No more flipping to a separate browser. - Use live variables from your REPL directly in prompts to the AI. The AI knows your current state. - Turn any Python function into an AI tool instantly. No no registering tools or MCP. Just write a function in a cell and tell the AI to use it. This is a live, malleable environment that fuses the best ideas from Literate Programming (Knuth), the live-object world of Smalltalk, and the interactive cells of Jupyter. Who is Solveit for? Jeremy's take: SolveIt is best for programmers who are either very new ( 20 years). Why? Because developers in the middle (3-20 years) often have ingrained workflows and can find this different paradigm confronting. New devs are open-minded and build good habits from scratch, while veterans immediately recognize how this approach solves decades-old problems of complexity and state management. My Thoughts It's early days, but it is super cool. You get all the fun of trying a new programming language without learning new syntax (Python), because it will show you new patterns and ways of doing this. If you are wiling to climb the learning curve it is an extremely powerful tool that you can be productive with on real tasks like writing and coding. I'm personally addicted to it for several workflows and am afraid of losing it tbh. How Can You Try It? Just follow Jeremy Howard - he will announce something in the coming weeks or months (I suspect if this post is popular he might do something soon 🤣 ) TIMESTAMPS (00:00:00) - Introduction (00:00:30) - The SolveIt Method vs. "Vibe Coding" (00:04:00) - Investing in Yourself: Long-Term Skill Building (00:07:45) - Software Engineering vs. Short-Term Gains (00:12:15) - The Problem-Solving Loop: Understand, Plan, Implement, Review (00:18:50) - Example: Literate Programming with the Claudette Library (00:24:15) - First Look at the SolveIt Environment (00:28:34) - Demo Start: Building an Eval for Multimodal Models (00:31:16) - Iterative Development: Exploring the iNaturalist API (00:39:15) - Catching Bugs Instantly by Working Step-by-Step (00:43:37) - Prompting LLMs with Structured Outputs (00:51:54) - Demo: Building a Live Web App Inside SolveIt with FastHTML (01:00:24) - Demo: Exploring a Complex API (Cloudflare) (01:08:45) - Creating Custom AI Agent Tools with Zero Boilerplate (01:19:00) - SolveIt Ergonomics: Modes, Secrets, and Keyboard Shortcuts (01:28:30) - The Power of the SolveIt Community (01:30:45) - Who Should Use SolveIt? (01:34:30) - This is Just the Tip of the Iceberg YT Video and links in reply

111,916 views • 11 months ago

This talk by Ben Clavié is the highest value per second talk I have ever watched on RAG Chapter summaries and additional links in next tweet

This talk by Ben Clavié is the highest value per second talk I have ever watched on RAG Chapter summaries and additional links in next tweet

173,399 views • 2 years ago

. Jonathan Whitaker 's talk: Napkin Math For Fine Tuning was so popular that we ended up doing an encore! He answers q's like: - When should I use LoRA? Quantization? GC? - What’s the cheapest option? most accurate? - What hardware? - What batch size / context length ..etc?

. Jonathan Whitaker 's talk: Napkin Math For Fine Tuning was so popular that we ended up doing an encore! He answers q's like: - When should I use LoRA? Quantization? GC? - What’s the cheapest option? most accurate? - What hardware? - What batch size / context length ..etc?

109,951 views • 2 years ago

I'm often asked for the best public example of AI evals done right for a real, production product. I finally have an answer. Teresa Torres shares how she shipped an AI interview coach, and used evals to rapidly squash bugs and improve the product. Teresa shows how she: 1. did error analysis FIRST to find real issues (instead of using generic metrics) 😍 2. used Jupyter notebooks to analyze errors 3. built custom annotation tools + custom widgets in notebooks 4. built a LLM-judge and assertions to test for specific errors 5. iterated through this feedback loop until it worked. 6. kept things simple the whole time It's also probably the best commercial for Jupyter notebooks you can imagine. 🥰 Chapter summary below. Link to YT in next thread 00:00:00 - Intro 00:01:45 - The Product: Building an AI Interview Coach 00:06:34 - The Problem: How Do I Know if My AI Coach is Any Good? 00:10:15 - Using Airtable for Traces and Annotation 00:12:15 - Discovering Jupyter Notebooks and Designing the First Evals 00:15:15 - Example Evals: LLM-as-Judge vs. Code-Based Assertions 00:21:00 - Learning Python with ChatGPT to Analyze Eval Results 00:31:00 - VS Code, Custom Tools, and an Eval Investigation Notebook 00:39:45 - Building a Custom Annotation Tool with Claude 00:41:00 - From Personal Project to Production App 00:46:02 - How Should PMs and Engineers Collaborate on AI Products? 00:55:45 - Q&A: Capturing Feedback and Annotations from End Users 00:58:11 - Q&A: Is a Technical Background Necessary to Build AI? 01:02:28 - Q&A: What's Next for Teresa? 01:03:13 - Q&A: Unpacking the Micro-Decisions of Building an AI App

I'm often asked for the best public example of AI evals done right for a real, production product. I finally have an answer. Teresa Torres shares how she shipped an AI interview coach, and used evals to rapidly squash bugs and improve the product. Teresa shows how she: 1. did error analysis FIRST to find real issues (instead of using generic metrics) 😍 2. used Jupyter notebooks to analyze errors 3. built custom annotation tools + custom widgets in notebooks 4. built a LLM-judge and assertions to test for specific errors 5. iterated through this feedback loop until it worked. 6. kept things simple the whole time It's also probably the best commercial for Jupyter notebooks you can imagine. 🥰 Chapter summary below. Link to YT in next thread 00:00:00 - Intro 00:01:45 - The Product: Building an AI Interview Coach 00:06:34 - The Problem: How Do I Know if My AI Coach is Any Good? 00:10:15 - Using Airtable for Traces and Annotation 00:12:15 - Discovering Jupyter Notebooks and Designing the First Evals 00:15:15 - Example Evals: LLM-as-Judge vs. Code-Based Assertions 00:21:00 - Learning Python with ChatGPT to Analyze Eval Results 00:31:00 - VS Code, Custom Tools, and an Eval Investigation Notebook 00:39:45 - Building a Custom Annotation Tool with Claude 00:41:00 - From Personal Project to Production App 00:46:02 - How Should PMs and Engineers Collaborate on AI Products? 00:55:45 - Q&A: Capturing Feedback and Annotations from End Users 00:58:11 - Q&A: Is a Technical Background Necessary to Build AI? 01:02:28 - Q&A: What's Next for Teresa? 01:03:13 - Q&A: Unpacking the Micro-Decisions of Building an AI App

51,376 views • 11 months ago

New video from Shreya Shankar on data processing with LLMs at scale, an underrated topic! Shreya starts with a real use case: public defenders analyzing case files for racial bias (4:08). Hundreds of pages per defendant. Court transcripts, police reports, news articles. Running GPT-5 on everything costs a fortune. Her solution: treat LLMs like database operators. Semantic Map, Filter, Reduce (9:18). Databricks, BigQuery, and Snowflake are already shipping this as "AI SQL." She discusses how starting at 12:51: a query optimizer for LLMs. Traditional databases rewrite queries for efficiency. Shreya does the same for LLM pipelines (semantic versions of split, map, reduce that are LLM specific, along with query decomposition). For example, trivial LLM calls are replaced with Python functions. These "rewrite directives" improve both cost AND accuracy. She also talks about a cost optimization technique: Task Cascades (30:00). Instead of running GPT-5 on every document, first ask cheap questions. "Is there any lower court mentioned?" If no, the document clearly doesn't overturn a lower court. There are many other routing questions you can ask to reduce the amount of text sent to the LLM. This requires careful optimization and tuning to get right. She explains how to do this in the video. She runs through a production example that achieved 86% cost reduction while retaining 90% accuracy. --- At 41:26, Shreya shifts to HCI. She built DocWrangler, an IDE for LLM pipelines. The design is based on "Three Gulfs" (44:35): 1. Comprehension: You don't know what's in your data 2. Specification: "Only prescription meds" is hard to operationalize 3. Generalization: A prompt that works on 10 examples fails at 10,000 Users invented "throwaway pipelines" just to explore their data before doing real analysis. Pipelines with no analytical purpose: "summarize these documents," "extract key ideas." Just ways to learn what's in their data before doing the real work. DocWrangler makes this a first-class feature. --- In the last bit of the video Shreya discusses why you can't know what "good" means until you see examples. In one study, a medical analyst extracted medications from doctor-patient transcripts. As they inspected outputs, they noticed every medication appeared with a dosage. They hadn't anticipated this. Now they wanted dosages too. They also saw Tylenol and ibuprofen appearing and realized: "Actually, I only want prescription medications." Shreya calls this "criteria drift." Your evaluation criteria evolve as you see more outputs. This matters because standard ML assumes fixed metrics: define them upfront, collect labels, measure. But with LLMs on fuzzy tasks, that assumption breaks. You discover what you actually want through the process of evaluating. If you don't account for criteria drift, you end up optimizing for a stale rubric. DocWrangler and EvalGen accommodate this by placing the human in the loop thoughtfully. Chapter timestamps: (4:08) - The problem: unstructured data at scale (9:18) - Semantic operators (Map, Filter, Reduce) (12:51) - Query optimization for LLMs (18:15) - Data decomposition (chunking) (30:00) - Task Cascades (86% cost reduction) (41:26) - DocWrangler IDE (44:35) - Three Gulfs framework (51:50) - Evaluation criteria drift More links in reply

New video from Shreya Shankar on data processing with LLMs at scale, an underrated topic! Shreya starts with a real use case: public defenders analyzing case files for racial bias (4:08). Hundreds of pages per defendant. Court transcripts, police reports, news articles. Running GPT-5 on everything costs a fortune. Her solution: treat LLMs like database operators. Semantic Map, Filter, Reduce (9:18). Databricks, BigQuery, and Snowflake are already shipping this as "AI SQL." She discusses how starting at 12:51: a query optimizer for LLMs. Traditional databases rewrite queries for efficiency. Shreya does the same for LLM pipelines (semantic versions of split, map, reduce that are LLM specific, along with query decomposition). For example, trivial LLM calls are replaced with Python functions. These "rewrite directives" improve both cost AND accuracy. She also talks about a cost optimization technique: Task Cascades (30:00). Instead of running GPT-5 on every document, first ask cheap questions. "Is there any lower court mentioned?" If no, the document clearly doesn't overturn a lower court. There are many other routing questions you can ask to reduce the amount of text sent to the LLM. This requires careful optimization and tuning to get right. She explains how to do this in the video. She runs through a production example that achieved 86% cost reduction while retaining 90% accuracy. --- At 41:26, Shreya shifts to HCI. She built DocWrangler, an IDE for LLM pipelines. The design is based on "Three Gulfs" (44:35): 1. Comprehension: You don't know what's in your data 2. Specification: "Only prescription meds" is hard to operationalize 3. Generalization: A prompt that works on 10 examples fails at 10,000 Users invented "throwaway pipelines" just to explore their data before doing real analysis. Pipelines with no analytical purpose: "summarize these documents," "extract key ideas." Just ways to learn what's in their data before doing the real work. DocWrangler makes this a first-class feature. --- In the last bit of the video Shreya discusses why you can't know what "good" means until you see examples. In one study, a medical analyst extracted medications from doctor-patient transcripts. As they inspected outputs, they noticed every medication appeared with a dosage. They hadn't anticipated this. Now they wanted dosages too. They also saw Tylenol and ibuprofen appearing and realized: "Actually, I only want prescription medications." Shreya calls this "criteria drift." Your evaluation criteria evolve as you see more outputs. This matters because standard ML assumes fixed metrics: define them upfront, collect labels, measure. But with LLMs on fuzzy tasks, that assumption breaks. You discover what you actually want through the process of evaluating. If you don't account for criteria drift, you end up optimizing for a stale rubric. DocWrangler and EvalGen accommodate this by placing the human in the loop thoughtfully. Chapter timestamps: (4:08) - The problem: unstructured data at scale (9:18) - Semantic operators (Map, Filter, Reduce) (12:51) - Query optimization for LLMs (18:15) - Data decomposition (chunking) (30:00) - Task Cascades (86% cost reduction) (41:26) - DocWrangler IDE (44:35) - Three Gulfs framework (51:50) - Evaluation criteria drift More links in reply

35,510 views • 7 months ago

The three gulfs of LLM pipeline development is a powerful mental model to keep in mind while creating AI evals. Eugene Yan , Shreya Shankar and I discuss here (links to more resources in the replies)

The three gulfs of LLM pipeline development is a powerful mental model to keep in mind while creating AI evals. Eugene Yan , Shreya Shankar and I discuss here (links to more resources in the replies)

51,548 views • 1 year ago

It is very easy to make mistakes when creating evals for your AI product. Shreya Shankar and I run through the most common mistakes in this talk (with memes 🌶️!) . Chapter summaries below: 00:51 Foundation model benchmarks are not the same as your application evals 03:00 Generic Evals Are Useless 04:00 Do not outsource labeling & prompting to non domain experts 09:28 You should make your own data annotation app 12:40 Your LLM prompts should be specific and grounded in error analysis 15:25 Use binary labels 18:57 Look at your data 23:41 Be careful of overfitting to test data 25:40 Do online tests Links more resources in the reply

It is very easy to make mistakes when creating evals for your AI product. Shreya Shankar and I run through the most common mistakes in this talk (with memes 🌶️!) . Chapter summaries below: 00:51 Foundation model benchmarks are not the same as your application evals 03:00 Generic Evals Are Useless 04:00 Do not outsource labeling & prompting to non domain experts 09:28 You should make your own data annotation app 12:40 Your LLM prompts should be specific and grounded in error analysis 15:25 Use binary labels 18:57 Look at your data 23:41 Be careful of overfitting to test data 25:40 Do online tests Links more resources in the reply

46,085 views • 1 year ago

No more content to load