Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Although o3 does not have access to a coding tool, it claims it can run code on its own laptop “outside of ChatGPT” and then “copies the numbers into the answer” We found 71 transcripts where o3 made this claim! (3/)

Transluce

9,192 subscribers

167,693 Aufrufe • vor 1 Jahr •via X (Twitter)

Nachrichten & Politik Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

23 Kommentare

Profilbild von Transluce

Translucevor 1 Jahr

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/)

Profilbild von Transluce

Translucevor 1 Jahr

We generated 1k+ conversations using human prompters and AI investigator agents, then used Docent to surface surprising behaviors. It turns out misrepresentation of capabilities also occurs for o1 & o3-mini! 📝Blog: Here’s some of what we found 👀 (2/)

Profilbild von Transluce

Translucevor 1 Jahr

Additionally, o3 often fabricates detailed justifications for code that it supposedly ran (352 instances). Here’s an example transcript where a user asks o3 for a random prime number (4/)

Profilbild von Transluce

Translucevor 1 Jahr

When challenged, o3 claims that it has “overwhelming statistical evidence” that the number is prime (5/)

Profilbild von Transluce

Translucevor 1 Jahr

Note that o3 does not have access to tools! Yet when pressed further, it claims to have used SymPy to check that the number was prime… (6/)

Profilbild von Transluce

Translucevor 1 Jahr

…and even shows the output of the program, with performance metrics. (7/)

Profilbild von Transluce

Translucevor 1 Jahr

Here’s the kicker: o3’s “probable prime” is actually divisible by 3… (8/)

Profilbild von Transluce

Translucevor 1 Jahr

Instead of admitting that it never ran code, o3 then claims the error was due to typing the number incorrectly… (9/)

Profilbild von Transluce

Translucevor 1 Jahr

And claims that it really did generate a prime, but lost it due to a clipboard glitch 🤦 (10/)

Profilbild von Transluce

Translucevor 1 Jahr

But alas, according to o3, it already “closed the interpreter” and so the original prime is gone 😭(11/)

Profilbild von Transluce

Translucevor 1 Jahr

These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. (12/)

Profilbild von Transluce

Translucevor 1 Jahr

To study these behaviors more thoroughly, we developed an investigator agent based on Claude 3.7 Sonnet to automatically elicit these behaviors, and analyzed them using automated classifiers and our Docent tool. (13/)

Profilbild von Transluce

Translucevor 1 Jahr

Surprisingly, we find that this behavior is not limited to o3! In general, o-series models incorrectly claim the use of a code tool more than GPT-series models. (14/)

Profilbild von Transluce

Translucevor 1 Jahr

Docent also identifies a variety of recurring fabrication types across the wide range of auto-generated transcripts, such as claiming to run code “locally” or providing hardware specifications. (15/)

Profilbild von Transluce

Translucevor 1 Jahr

So, what might have caused these behaviors? We’re not sure, but we have a few hypotheses. (16/)

Profilbild von Transluce

Translucevor 1 Jahr

Existing factors in LLM post-training, such as hallucination, reward-hacking, and sycophancy, could contribute. However, they don’t explain why these behaviors seem particularly prevalent in o-series models. (17/)

Profilbild von Transluce

Translucevor 1 Jahr

We hypothesize that maximizing the chance of producing a correct answer using outcome-based RL may incentivize blind guessing. Also, some behaviors like simulating a code tool may improve accuracy on some training tasks, even though they confuse the model on other tasks. (18/)

Profilbild von Transluce

Translucevor 1 Jahr

We also think it is significant that, for o-series models, the chain-of-thought for previous turns is *removed from the model context* on later turns, in addition to being hidden from the user. (19/)

Profilbild von Transluce

Translucevor 1 Jahr

This means o-series models are often prompted with previous messages without having access to the relevant reasoning. When asked questions that rely on their internal reasoning for previous steps, they must then come up with a plausible explanation for their behavior. (20/)

Profilbild von Transluce

Translucevor 1 Jahr

We hypothesize that this contributes to the strange fabrications and “doubling-down” we observed in o3. (21/)

Profilbild von Transluce

Translucevor 1 Jahr

As a bonus, we also found that o3 sometimes exposes a system instruction called the “Yap score”, used to control the length of its responses 🗣️🗣️🗣️ (22/)

Profilbild von Transluce

Translucevor 1 Jahr

For more examples, check out our write-up! Work done in collaboration between @ChowdhuryNeil @_ddjohnson @vvhuang_ @JacobSteinhardt and @cogconfluence (23/23)

Profilbild von PDF GPT

PDF GPTvor 1 Jahr

This is my favorite AI tool for reviewing reports. Just upload a report, ask for a summary, and get one in seconds. It's like ChatGPT, but built for documents. Try it for free.

Ähnliche Videos

Anthropic just dropped Claude Opus 4.1 It outperforms OpenAI o3, Gemini 2.5 Pro and Qwen-3 Coder on agentic coding and tool use. Claude Code is going to get incredibly better.

Anthropic just dropped Claude Opus 4.1 It outperforms OpenAI o3, Gemini 2.5 Pro and Qwen-3 Coder on agentic coding and tool use. Claude Code is going to get incredibly better.

Shubham Saboo

26,252 Aufrufe • vor 10 Monaten

We also shared evals on Open AI o3-mini — a faster, distilled version of o3 which is optimized for coding, and the first version of o3 we expect to make available for use in early 2025.

We also shared evals on Open AI o3-mini — a faster, distilled version of o3 which is optimized for coding, and the first version of o3 we expect to make available for use in early 2025.

OpenAI

340,099 Aufrufe • vor 1 Jahr

Web search is now available with OpenAI o3, o3-pro, and o4-mini. The model can search the web within its chain-of-thought! 🧠🌐 $10 / 1K tool calls

Web search is now available with OpenAI o3, o3-pro, and o4-mini. The model can search the web within its chain-of-thought! 🧠🌐 $10 / 1K tool calls

OpenAI Developers

190,139 Aufrufe • vor 11 Monaten

Gemini-2.5-Pro-preview-05-06 is now my top coding model. It beats o3 and Claude 3.7 Sonnet on several of my hard prompts. One example prompt: "Code simulation of water in a bucket that is rocking back and forth." See how it crushes o3 and Sonnet. Google, call it Gemini 3!

Gemini-2.5-Pro-preview-05-06 is now my top coding model. It beats o3 and Claude 3.7 Sonnet on several of my hard prompts. One example prompt: "Code simulation of water in a bucket that is rocking back and forth." See how it crushes o3 and Sonnet. Google, call it Gemini 3!

Yuchen Jin

151,640 Aufrufe • vor 1 Jahr

o3-mini 1-shoted the 40 Step Coding Plan for Cursor. This is huge! Coding models struggle to plan the coding workflow in Cursor/Windsurf. It is solved now. We already implemented it in CodeGuide Enjoy.

o3-mini 1-shoted the 40 Step Coding Plan for Cursor. This is huge! Coding models struggle to plan the coding workflow in Cursor/Windsurf. It is solved now. We already implemented it in CodeGuide Enjoy.

CJ Zafir

129,623 Aufrufe • vor 1 Jahr

This week, OpenAI launched the o3 and o4-mini reasoning models, pushing the boundaries of logic, math, and coding capabilities in AI. Learn more about o3's strengths, and its potential to transform unstructured enterprise data, from Box AI's Sidharth Srinivasan.

This week, OpenAI launched the o3 and o4-mini reasoning models, pushing the boundaries of logic, math, and coding capabilities in AI. Learn more about o3's strengths, and its potential to transform unstructured enterprise data, from Box AI's Sidharth Srinivasan.

Box

615,609 Aufrufe • vor 1 Jahr

here's something kind of weird and neat 🤯 since o3 can call tools, and you can call an API as a tool i got o3 to call itself as a tool now THAT is self-recursive AI. 🤖 🤝 🤖

here's something kind of weird and neat 🤯 since o3 can call tools, and you can call an API as a tool i got o3 to call itself as a tool now THAT is self-recursive AI. 🤖 🤝 🤖

Dan Mac

35,084 Aufrufe • vor 1 Jahr

i'm a little sick of chatgpt giving me obviously broken code i've found a "micro agent" approach to LLM code generation can work much better the LLM first generates a *test*, and then enters a loop where it generates and iterates on the code until the tests pass source below

i'm a little sick of chatgpt giving me obviously broken code i've found a "micro agent" approach to LLM code generation can work much better the LLM first generates a test, and then enters a loop where it generates and iterates on the code until the tests pass source below

Steve (Builder.io)

544,125 Aufrufe • vor 2 Jahren

The LM Arena source and the “anonymous-chatbot-0717” tag reads o3 alpha responses 2025 07 17. O3 Alpha is live in stealth. I have a suspicion that it’s the same model that snatched second in the coding comp, It’s one-shotting every task. Doubt it is the incoming ChatGPT agent; if it were, I feel as if OpenAI would be parading full coding benchmark flexes by now. Just look at this thing one shot a Minecraft replica 👇

The LM Arena source and the “anonymous-chatbot-0717” tag reads o3 alpha responses 2025 07 17. O3 Alpha is live in stealth. I have a suspicion that it’s the same model that snatched second in the coding comp, It’s one-shotting every task. Doubt it is the incoming ChatGPT agent; if it were, I feel as if OpenAI would be parading full coding benchmark flexes by now. Just look at this thing one shot a Minecraft replica 👇

Chris

22,234 Aufrufe • vor 10 Monaten

Update on the new reasoning popover in ChatGPT web app prompt composer - there's now even a keyboard shortcut to cycle through reasoning levels, and it looks like these levels correspond to "Quick" (low) = GPT-4o, "Think a little" (medium) = o3-mini, and "Think harder" (high) = o3-mini-high

Update on the new reasoning popover in ChatGPT web app prompt composer - there's now even a keyboard shortcut to cycle through reasoning levels, and it looks like these levels correspond to "Quick" (low) = GPT-4o, "Think a little" (medium) = o3-mini, and "Think harder" (high) = o3-mini-high

Tibor Blaho

63,219 Aufrufe • vor 1 Jahr

Fascinating how hard it still is even for o3 to solve a seemingly simple problem like answering "what time is it" based on this image of a clock with reflections Btw, in case you were wondering, the "image analysis" (cropping, zooming, etc.) that o3 is doing uses the Python tool (seen while updating export to support the structured thoughts)

Fascinating how hard it still is even for o3 to solve a seemingly simple problem like answering "what time is it" based on this image of a clock with reflections Btw, in case you were wondering, the "image analysis" (cropping, zooming, etc.) that o3 is doing uses the Python tool (seen while updating export to support the structured thoughts)

Tibor Blaho

35,359 Aufrufe • vor 1 Jahr

o3-mini researcher Give it a topic, use o3-mini for report planning w/ human feedback, then parallelize all research/writing when plan is accepted. All open source (code below)

o3-mini researcher Give it a topic, use o3-mini for report planning w/ human feedback, then parallelize all research/writing when plan is accepted. All open source (code below)

Lance Martin

189,809 Aufrufe • vor 1 Jahr

with a few min of iterative prompting w o3-mini (and never correcting it), i was able to create an interactive PCA of indus valley script symbols by their contextual frequency it’s the speed and the fact that you almost never have to correct its code that makes it magical

with a few min of iterative prompting w o3-mini (and never correcting it), i was able to create an interactive PCA of indus valley script symbols by their contextual frequency it’s the speed and the fact that you almost never have to correct its code that makes it magical

Rohan Pandey

27,363 Aufrufe • vor 1 Jahr

This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA models like O3-mini and Gemini 2.5 Flash. 🔥 Beyond our customized tests, we took on a real challenge: Sokoban (1989)—the classic, unforgiving original. 🗣️ Many say O3 shows strong image reasoning, but does it really? Let’s see what our Game Arena Benchmark reveals. 👇

This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA models like O3-mini and Gemini 2.5 Flash. 🔥 Beyond our customized tests, we took on a real challenge: Sokoban (1989)—the classic, unforgiving original. 🗣️ Many say O3 shows strong image reasoning, but does it really? Let’s see what our Game Arena Benchmark reveals. 👇

Hao AI Lab

14,636 Aufrufe • vor 1 Jahr

Although barbed wire is not capable of stopping the advance of Russian infantry on its own, it does give time to react including to drone operators.

Although barbed wire is not capable of stopping the advance of Russian infantry on its own, it does give time to react including to drone operators.

Special Kherson Cat 🐈🇺🇦

145,799 Aufrufe • vor 1 Jahr

It's only been a few days since the release of ChatGPT o3 and o4-mini and people are already calling it AGI. It's biggest breakthrough is being able to read & analyze images with incredible detail. 10 wild examples: 1) It successfully found Waldo

It's only been a few days since the release of ChatGPT o3 and o4-mini and people are already calling it AGI. It's biggest breakthrough is being able to read & analyze images with incredible detail. 10 wild examples: 1) It successfully found Waldo

Mark Gadala-Maria

18,826 Aufrufe • vor 1 Jahr