Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Run evals—directly from the OpenAI dashboard. Use your test data to compare model performance, iterate on prompts, and improve outputs. Here's a quick walkthrough:

OpenAI Developers

363,875 subscribers

86,738 views • 1 year ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

10 Comments

alex fazio1 year ago

just a chill guy k-lling thousands of startups with a single post

Markus Odenthal1 year ago

This is pretty cool! Most people miss the evaluation steps. But it's actually crucial.

Quintus 🏛️1 year ago

Can you let us export the eval to a .csv?

Prompt Perfect1 year ago

this is a healthy post.

william1 year ago

eval api wen?

Diego | AI 🚀 - e/acc1 year ago

🧪 What Are OpenAI Evals? OpenAI Evals is a tool to test and measure how well AI models perform specific tasks. Think of it like a report card for AI—researchers and developers use it to ensure the model gives accurate and helpful answers before it’s used in the real world. 📊🤖

TestingCatalog News 🗞1 year ago

I think now we will need a way to run these on CI 😋

Jim Hull1 year ago

This is same as it has been for two months right? Took me a bit to wrap my head around it, but now i want to eval EVERYTHING! Great feature.

Chase Brower1 year ago

The factuality passing grades seems to be mixed up somewhere in the pipeline, it gives a result (A) matching what I selected, but then it says fail. I found if I check the 'Response disagrees with the ground truth' (and only that) then it says success.

Soumyajit1 year ago

This is a great feature.

Related Videos

here's a quick look at an internal tool we made to build agents faster. we iterate on our agent's internal structure, run through dynamically generated prompts, and test new samples of outputs more in-distribution of how our agents act. let us know below if you'd like access

here's a quick look at an internal tool we made to build agents faster. we iterate on our agent's internal structure, run through dynamically generated prompts, and test new samples of outputs more in-distribution of how our agents act. let us know below if you'd like access

nico

15,554 views • 1 year ago

The Chat Playground is now the Prompts Playground—we've redesigned it to help you better test, compare, and iterate on prompts, including with tools like web search and file search. 🛝

The Chat Playground is now the Prompts Playground—we've redesigned it to help you better test, compare, and iterate on prompts, including with tools like web search and file search. 🛝

OpenAI Developers

94,686 views • 1 year ago

Respan Launch Week, Day 2: Evals Evals are one of the hardest parts of building AI applications. It is not because teams cannot run them. It is because they are hard to structure, hard to compare, and even harder to improve over time. So teams end up guessing. Did this prompt actually get better? Is this model really an improvement? We built Evals in Respan to make this systematic. You define: - what you want to test, such as prompts, models, or configs - the dataset, from production logs or test cases - the evaluators, whether LLM, code, human, or a mix Then you run experiments and compare results side by side. Same inputs. Same evaluation. Clear answers. No more guessing. Start running your first eval.

Respan Launch Week, Day 2: Evals Evals are one of the hardest parts of building AI applications. It is not because teams cannot run them. It is because they are hard to structure, hard to compare, and even harder to improve over time. So teams end up guessing. Did this prompt actually get better? Is this model really an improvement? We built Evals in Respan to make this systematic. You define: - what you want to test, such as prompts, models, or configs - the dataset, from production logs or test cases - the evaluators, whether LLM, code, human, or a mix Then you run experiments and compare results side by side. Same inputs. Same evaluation. Clear answers. No more guessing. Start running your first eval.

Respan

127,656 views • 3 months ago

Wow OpenAI just launched ChatGPT Tasks and it's basically AI Agents for everyone Now AI Agents can do work for you 24/7 Here's your full walkthrough, how to use ChatGPT Tasks best, and 6 prompts that will IMMEDIATELY improve your life (trust me you want to bookmark this one)

Wow OpenAI just launched ChatGPT Tasks and it's basically AI Agents for everyone Now AI Agents can do work for you 24/7 Here's your full walkthrough, how to use ChatGPT Tasks best, and 6 prompts that will IMMEDIATELY improve your life (trust me you want to bookmark this one)

Alex Finn

565,848 views • 1 year ago

It is very easy to make mistakes when creating evals for your AI product. Shreya Shankar and I run through the most common mistakes in this talk (with memes 🌶️!) . Chapter summaries below: 00:51 Foundation model benchmarks are not the same as your application evals 03:00 Generic Evals Are Useless 04:00 Do not outsource labeling & prompting to non domain experts 09:28 You should make your own data annotation app 12:40 Your LLM prompts should be specific and grounded in error analysis 15:25 Use binary labels 18:57 Look at your data 23:41 Be careful of overfitting to test data 25:40 Do online tests Links more resources in the reply

It is very easy to make mistakes when creating evals for your AI product. Shreya Shankar and I run through the most common mistakes in this talk (with memes 🌶️!) . Chapter summaries below: 00:51 Foundation model benchmarks are not the same as your application evals 03:00 Generic Evals Are Useless 04:00 Do not outsource labeling & prompting to non domain experts 09:28 You should make your own data annotation app 12:40 Your LLM prompts should be specific and grounded in error analysis 15:25 Use binary labels 18:57 Look at your data 23:41 Be careful of overfitting to test data 25:40 Do online tests Links more resources in the reply

Hamel Husain

46,085 views • 1 year ago

how you can use openAI codex & gpt 5.5 completely FREE (the full guide) 100% legit. no subscription, zero API cost. up to 1M+ token/day. you need just an openAI account and here's how to set it up in 5mins. openAI has a program that gives eligible developers free API usage every day in exchange for sharing API data that helps improve future models. it's not a one-time credit, your allowance refreshes daily. depending on your usage tier, you can get access to hundreds of thousands, or even millions, of free tokens every single day on supported models. here's how to activate it: 1️⃣open your API dashboard: 2️⃣go to settings → data controls 3️⃣enable data sharing for your organization or project 4️⃣make sure your account has a positive API balance 5️⃣save the settings if your account is eligible, you'll see a message confirming access to complimentary daily usage. before you turn it on, know the tradeoff: • prompts and outputs from shared projects can be used to improve openai's models • don't use it for confidential information, client work, or sensitive data • eligibility depends on your account type and settings for everyone else, it's an incredible deal. use it to: • learn AI development • build side projects • experiment with codex • test agents and automations • prototype ideas without worrying about API costs most developers burn money testing ideas. this lets you experiment at scale while spending little to nothing.

how you can use openAI codex & gpt 5.5 completely FREE (the full guide) 100% legit. no subscription, zero API cost. up to 1M+ token/day. you need just an openAI account and here's how to set it up in 5mins. openAI has a program that gives eligible developers free API usage every day in exchange for sharing API data that helps improve future models. it's not a one-time credit, your allowance refreshes daily. depending on your usage tier, you can get access to hundreds of thousands, or even millions, of free tokens every single day on supported models. here's how to activate it: 1️⃣open your API dashboard: 2️⃣go to settings → data controls 3️⃣enable data sharing for your organization or project 4️⃣make sure your account has a positive API balance 5️⃣save the settings if your account is eligible, you'll see a message confirming access to complimentary daily usage. before you turn it on, know the tradeoff: • prompts and outputs from shared projects can be used to improve openai's models • don't use it for confidential information, client work, or sensitive data • eligibility depends on your account type and settings for everyone else, it's an incredible deal. use it to: • learn AI development • build side projects • experiment with codex • test agents and automations • prototype ideas without worrying about API costs most developers burn money testing ideas. this lets you experiment at scale while spending little to nothing.

m0h

68,972 views • 1 month ago

I added Avalanche data to BundleBear, my ERC4337 dashboard 🔺✨ Here's a quick rundown of ERC4337 activity on Avalanche

I added Avalanche data to BundleBear, my ERC4337 dashboard 🔺✨ Here's a quick rundown of ERC4337 activity on Avalanche

Kofi

18,972 views • 2 years ago

You can also use gpt-image-1 in the Playground to quickly iterate on prompts and images:

You can also use gpt-image-1 in the Playground to quickly iterate on prompts and images:

OpenAI Developers

69,089 views • 1 year ago

Create datasets, run evals, and even train models directly in Cursor with the Hugging Face plugin. Here's Ben Burtenshaw to show you how:

Create datasets, run evals, and even train models directly in Cursor with the Hugging Face plugin. Here's Ben Burtenshaw to show you how:

edwin

17,119 views • 4 months ago

Model, generate, refine and iterate - that's the loop! Then export your model to use anywhere you like. #madeWithUnbound

Model, generate, refine and iterate - that's the loop! Then export your model to use anywhere you like. #madeWithUnbound

Unbound

11,446 views • 1 month ago

Today we're launching the PlanetScale hosted MCP server. Use your favorite AI agent to explore your data, identify slow queries, and improve database performance.

Today we're launching the PlanetScale hosted MCP server. Use your favorite AI agent to explore your data, identify slow queries, and improve database performance.

PlanetScale

29,388 views • 6 months ago

In pick the best model > improve your prompts > catch bugs with Braintrust and Cerebras inference. Avoid the AI deleting your entire codebase this halloween and remember, our free tier gets you 1M+ free toks/day per model.

In pick the best model > improve your prompts > catch bugs with Braintrust and Cerebras inference. Avoid the AI deleting your entire codebase this halloween and remember, our free tier gets you 1M+ free toks/day per model.

Cerebras

299,029 views • 9 months ago

As promised, here's a quick clip of the OG Xbox dashboard running natively on a Nintendo Switch thanks to the amazing work from Milenko and TeamUIX

As promised, here's a quick clip of the OG Xbox dashboard running natively on a Nintendo Switch thanks to the amazing work from Milenko and TeamUIX

Generalkidd

55,381 views • 3 months ago

you might've heard about context engineering, but nobody has explained what it really is... that's why I made a full practical walkthrough on how to use it to make your AI outputs personalized and consistent in this walkthrough, I go over: – what context engineering actually is (without the jargon) – how to train AI on your own data and preferences – how to get outputs that sound like you wrote them (not generic AI) reply "context" and I'll DM it to you (must be following)

you might've heard about context engineering, but nobody has explained what it really is... that's why I made a full practical walkthrough on how to use it to make your AI outputs personalized and consistent in this walkthrough, I go over: – what context engineering actually is (without the jargon) – how to train AI on your own data and preferences – how to get outputs that sound like you wrote them (not generic AI) reply "context" and I'll DM it to you (must be following)

Tyler

20,012 views • 11 months ago

Generate images and video directly from your terminal with Claude + our new genmedia CLI. • Search and run fal models from CLI • Generate structured commands with Claude • Run async media workflows • Download outputs locally • Use installable agent skills Watch the full tutorial:

Generate images and video directly from your terminal with Claude + our new genmedia CLI. • Search and run fal models from CLI • Generate structured commands with Claude • Run async media workflows • Download outputs locally • Use installable agent skills Watch the full tutorial:

fal

11,549 views • 2 months ago

👨‍💻Have you tested models in the new Code Arena yet? In this thread, we’re showcasing real Gemini 3 Pro by Google DeepMind creations and the exact prompts used, all of them built inside the Code Arena. You can directly compare Gemini’s outputs against other top frontier models on real web development tasks. Build, compare, vote, and share your own creations directly from the chat. See examples in the thread. 🧵

👨‍💻Have you tested models in the new Code Arena yet? In this thread, we’re showcasing real Gemini 3 Pro by Google DeepMind creations and the exact prompts used, all of them built inside the Code Arena. You can directly compare Gemini’s outputs against other top frontier models on real web development tasks. Build, compare, vote, and share your own creations directly from the chat. See examples in the thread. 🧵

Arena.ai

25,393 views • 8 months ago

SN35 (Cartha) Testnet is live and here is a miner setup walkthrough. If you want to get hands-on before mainnet, now’s the time. Test, break, iterate. Mainnet is coming.

SN35 (Cartha) Testnet is live and here is a miner setup walkthrough. If you want to get hands-on before mainnet, now’s the time. Test, break, iterate. Mainnet is coming.

0xMarkets

11,286 views • 6 months ago

Websites now self-improve! Our AI agent (trained on thousands of tests) comes up with website variations and then launches A/B tests. When tests reach stat sig, we use the data collected to improve your model and repeat the process. Try it:

Websites now self-improve! Our AI agent (trained on thousands of tests) comes up with website variations and then launches A/B tests. When tests reach stat sig, we use the data collected to improve your model and repeat the process. Try it:

Fab

6,793,216 views • 1 year ago

How to train a reasoning model, specifically Magistral from Mistral AI? Here is a quick video walkthrough of the paper.

How to train a reasoning model, specifically Magistral from Mistral AI? Here is a quick video walkthrough of the paper.

Sophia Yang, Ph.D.

31,752 views • 1 year ago

Evals now supports tool use. 🛠️ You can now use tools and Structured Outputs when completing eval runs, and evaluate tool calls based on the arguments passed and responses returned. This supports tools that are OpenAI-hosted, MCP, and non-hosted. Read more in our guides below.

Evals now supports tool use. 🛠️ You can now use tools and Structured Outputs when completing eval runs, and evaluate tool calls based on the arguments passed and responses returned. This supports tools that are OpenAI-hosted, MCP, and non-hosted. Read more in our guides below.

OpenAI Developers

65,080 views • 1 year ago