Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Although o3 does not have access to a coding tool, it claims it can run code on its own laptop “outside of ChatGPT” and then “copies the numbers into the answer” We found 71 transcripts where o3 made this claim! (3/)

Transluce

9,192 subscribers

167,693 views • 1 year ago •via X (Twitter)

News & Politics Science & Technology Education

Anya Rossi• Live Now

Private livecam show

23 Comments

Transluce1 year ago

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/)

Transluce1 year ago

We generated 1k+ conversations using human prompters and AI investigator agents, then used Docent to surface surprising behaviors. It turns out misrepresentation of capabilities also occurs for o1 & o3-mini! 📝Blog: Here’s some of what we found 👀 (2/)

Transluce1 year ago

Additionally, o3 often fabricates detailed justifications for code that it supposedly ran (352 instances). Here’s an example transcript where a user asks o3 for a random prime number (4/)

Transluce1 year ago

When challenged, o3 claims that it has “overwhelming statistical evidence” that the number is prime (5/)

Transluce1 year ago

Note that o3 does not have access to tools! Yet when pressed further, it claims to have used SymPy to check that the number was prime… (6/)

Transluce1 year ago

…and even shows the output of the program, with performance metrics. (7/)

Transluce1 year ago

Here’s the kicker: o3’s “probable prime” is actually divisible by 3… (8/)

Transluce1 year ago

Instead of admitting that it never ran code, o3 then claims the error was due to typing the number incorrectly… (9/)

Transluce1 year ago

And claims that it really did generate a prime, but lost it due to a clipboard glitch 🤦 (10/)

Transluce1 year ago

But alas, according to o3, it already “closed the interpreter” and so the original prime is gone 😭(11/)

Transluce1 year ago

These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. (12/)

Transluce1 year ago

To study these behaviors more thoroughly, we developed an investigator agent based on Claude 3.7 Sonnet to automatically elicit these behaviors, and analyzed them using automated classifiers and our Docent tool. (13/)

Transluce1 year ago

Surprisingly, we find that this behavior is not limited to o3! In general, o-series models incorrectly claim the use of a code tool more than GPT-series models. (14/)

Transluce1 year ago

Docent also identifies a variety of recurring fabrication types across the wide range of auto-generated transcripts, such as claiming to run code “locally” or providing hardware specifications. (15/)

Transluce1 year ago

So, what might have caused these behaviors? We’re not sure, but we have a few hypotheses. (16/)

Transluce1 year ago

Existing factors in LLM post-training, such as hallucination, reward-hacking, and sycophancy, could contribute. However, they don’t explain why these behaviors seem particularly prevalent in o-series models. (17/)

Transluce1 year ago

We hypothesize that maximizing the chance of producing a correct answer using outcome-based RL may incentivize blind guessing. Also, some behaviors like simulating a code tool may improve accuracy on some training tasks, even though they confuse the model on other tasks. (18/)

Transluce1 year ago

We also think it is significant that, for o-series models, the chain-of-thought for previous turns is *removed from the model context* on later turns, in addition to being hidden from the user. (19/)

Transluce1 year ago

This means o-series models are often prompted with previous messages without having access to the relevant reasoning. When asked questions that rely on their internal reasoning for previous steps, they must then come up with a plausible explanation for their behavior. (20/)

Transluce1 year ago

We hypothesize that this contributes to the strange fabrications and “doubling-down” we observed in o3. (21/)

Transluce1 year ago

As a bonus, we also found that o3 sometimes exposes a system instruction called the “Yap score”, used to control the length of its responses 🗣️🗣️🗣️ (22/)

Transluce1 year ago

For more examples, check out our write-up! Work done in collaboration between @ChowdhuryNeil @_ddjohnson @vvhuang_ @JacobSteinhardt and @cogconfluence (23/23)

PDF GPT1 year ago

This is my favorite AI tool for reviewing reports. Just upload a report, ask for a summary, and get one in seconds. It's like ChatGPT, but built for documents. Try it for free.

Related Videos

Anthropic just dropped Claude Opus 4.1 It outperforms OpenAI o3, Gemini 2.5 Pro and Qwen-3 Coder on agentic coding and tool use. Claude Code is going to get incredibly better.

Anthropic just dropped Claude Opus 4.1 It outperforms OpenAI o3, Gemini 2.5 Pro and Qwen-3 Coder on agentic coding and tool use. Claude Code is going to get incredibly better.

Shubham Saboo

26,252 views • 10 months ago

We also shared evals on Open AI o3-mini — a faster, distilled version of o3 which is optimized for coding, and the first version of o3 we expect to make available for use in early 2025.

We also shared evals on Open AI o3-mini — a faster, distilled version of o3 which is optimized for coding, and the first version of o3 we expect to make available for use in early 2025.

OpenAI

340,099 views • 1 year ago

Web search is now available with OpenAI o3, o3-pro, and o4-mini. The model can search the web within its chain-of-thought! 🧠🌐 $10 / 1K tool calls

Web search is now available with OpenAI o3, o3-pro, and o4-mini. The model can search the web within its chain-of-thought! 🧠🌐 $10 / 1K tool calls

OpenAI Developers

190,139 views • 11 months ago

Gemini-2.5-Pro-preview-05-06 is now my top coding model. It beats o3 and Claude 3.7 Sonnet on several of my hard prompts. One example prompt: "Code simulation of water in a bucket that is rocking back and forth." See how it crushes o3 and Sonnet. Google, call it Gemini 3!

Gemini-2.5-Pro-preview-05-06 is now my top coding model. It beats o3 and Claude 3.7 Sonnet on several of my hard prompts. One example prompt: "Code simulation of water in a bucket that is rocking back and forth." See how it crushes o3 and Sonnet. Google, call it Gemini 3!

Yuchen Jin

151,640 views • 1 year ago

o3-mini 1-shoted the 40 Step Coding Plan for Cursor. This is huge! Coding models struggle to plan the coding workflow in Cursor/Windsurf. It is solved now. We already implemented it in CodeGuide Enjoy.

o3-mini 1-shoted the 40 Step Coding Plan for Cursor. This is huge! Coding models struggle to plan the coding workflow in Cursor/Windsurf. It is solved now. We already implemented it in CodeGuide Enjoy.

CJ Zafir

129,623 views • 1 year ago

This week, OpenAI launched the o3 and o4-mini reasoning models, pushing the boundaries of logic, math, and coding capabilities in AI. Learn more about o3's strengths, and its potential to transform unstructured enterprise data, from Box AI's Sidharth Srinivasan.

This week, OpenAI launched the o3 and o4-mini reasoning models, pushing the boundaries of logic, math, and coding capabilities in AI. Learn more about o3's strengths, and its potential to transform unstructured enterprise data, from Box AI's Sidharth Srinivasan.

Box

615,609 views • 1 year ago

here's something kind of weird and neat 🤯 since o3 can call tools, and you can call an API as a tool i got o3 to call itself as a tool now THAT is self-recursive AI. 🤖 🤝 🤖

here's something kind of weird and neat 🤯 since o3 can call tools, and you can call an API as a tool i got o3 to call itself as a tool now THAT is self-recursive AI. 🤖 🤝 🤖

Dan Mac

35,084 views • 1 year ago

i'm a little sick of chatgpt giving me obviously broken code i've found a "micro agent" approach to LLM code generation can work much better the LLM first generates a *test*, and then enters a loop where it generates and iterates on the code until the tests pass source below

i'm a little sick of chatgpt giving me obviously broken code i've found a "micro agent" approach to LLM code generation can work much better the LLM first generates a test, and then enters a loop where it generates and iterates on the code until the tests pass source below

Steve (Builder.io)

544,125 views • 2 years ago

The LM Arena source and the “anonymous-chatbot-0717” tag reads o3 alpha responses 2025 07 17. O3 Alpha is live in stealth. I have a suspicion that it’s the same model that snatched second in the coding comp, It’s one-shotting every task. Doubt it is the incoming ChatGPT agent; if it were, I feel as if OpenAI would be parading full coding benchmark flexes by now. Just look at this thing one shot a Minecraft replica 👇

The LM Arena source and the “anonymous-chatbot-0717” tag reads o3 alpha responses 2025 07 17. O3 Alpha is live in stealth. I have a suspicion that it’s the same model that snatched second in the coding comp, It’s one-shotting every task. Doubt it is the incoming ChatGPT agent; if it were, I feel as if OpenAI would be parading full coding benchmark flexes by now. Just look at this thing one shot a Minecraft replica 👇

Chris

22,234 views • 11 months ago

Update on the new reasoning popover in ChatGPT web app prompt composer - there's now even a keyboard shortcut to cycle through reasoning levels, and it looks like these levels correspond to "Quick" (low) = GPT-4o, "Think a little" (medium) = o3-mini, and "Think harder" (high) = o3-mini-high

Update on the new reasoning popover in ChatGPT web app prompt composer - there's now even a keyboard shortcut to cycle through reasoning levels, and it looks like these levels correspond to "Quick" (low) = GPT-4o, "Think a little" (medium) = o3-mini, and "Think harder" (high) = o3-mini-high

Tibor Blaho

63,219 views • 1 year ago

Fascinating how hard it still is even for o3 to solve a seemingly simple problem like answering "what time is it" based on this image of a clock with reflections Btw, in case you were wondering, the "image analysis" (cropping, zooming, etc.) that o3 is doing uses the Python tool (seen while updating export to support the structured thoughts)

Fascinating how hard it still is even for o3 to solve a seemingly simple problem like answering "what time is it" based on this image of a clock with reflections Btw, in case you were wondering, the "image analysis" (cropping, zooming, etc.) that o3 is doing uses the Python tool (seen while updating export to support the structured thoughts)

Tibor Blaho

35,359 views • 1 year ago

with a few min of iterative prompting w o3-mini (and never correcting it), i was able to create an interactive PCA of indus valley script symbols by their contextual frequency it’s the speed and the fact that you almost never have to correct its code that makes it magical

with a few min of iterative prompting w o3-mini (and never correcting it), i was able to create an interactive PCA of indus valley script symbols by their contextual frequency it’s the speed and the fact that you almost never have to correct its code that makes it magical

Rohan Pandey

27,363 views • 1 year ago

o3-mini researcher Give it a topic, use o3-mini for report planning w/ human feedback, then parallelize all research/writing when plan is accepted. All open source (code below)

o3-mini researcher Give it a topic, use o3-mini for report planning w/ human feedback, then parallelize all research/writing when plan is accepted. All open source (code below)

Lance Martin

189,809 views • 1 year ago

This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA models like O3-mini and Gemini 2.5 Flash. 🔥 Beyond our customized tests, we took on a real challenge: Sokoban (1989)—the classic, unforgiving original. 🗣️ Many say O3 shows strong image reasoning, but does it really? Let’s see what our Game Arena Benchmark reveals. 👇

This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA models like O3-mini and Gemini 2.5 Flash. 🔥 Beyond our customized tests, we took on a real challenge: Sokoban (1989)—the classic, unforgiving original. 🗣️ Many say O3 shows strong image reasoning, but does it really? Let’s see what our Game Arena Benchmark reveals. 👇

Hao AI Lab

14,636 views • 1 year ago

Although barbed wire is not capable of stopping the advance of Russian infantry on its own, it does give time to react including to drone operators.

Although barbed wire is not capable of stopping the advance of Russian infantry on its own, it does give time to react including to drone operators.

Special Kherson Cat 🐈🇺🇦

145,799 views • 1 year ago

It's only been a few days since the release of ChatGPT o3 and o4-mini and people are already calling it AGI. It's biggest breakthrough is being able to read & analyze images with incredible detail. 10 wild examples: 1) It successfully found Waldo

It's only been a few days since the release of ChatGPT o3 and o4-mini and people are already calling it AGI. It's biggest breakthrough is being able to read & analyze images with incredible detail. 10 wild examples: 1) It successfully found Waldo

Mark Gadala-Maria

18,826 views • 1 year ago