Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Although o3 does not have access to a coding tool, it claims it can run code on its own laptop “outside of ChatGPT” and then “copies the numbers into the answer” We found 71 transcripts where o3 made this claim! (3/)

Transluce

9,192 subscribers

167,693 görüntüleme • 1 yıl önce •via X (Twitter)

Haberler & Politika Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

23 Yorum

Transluce profil fotoğrafı

Transluce1 yıl önce

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/)

Transluce profil fotoğrafı

Transluce1 yıl önce

We generated 1k+ conversations using human prompters and AI investigator agents, then used Docent to surface surprising behaviors. It turns out misrepresentation of capabilities also occurs for o1 & o3-mini! 📝Blog: Here’s some of what we found 👀 (2/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Additionally, o3 often fabricates detailed justifications for code that it supposedly ran (352 instances). Here’s an example transcript where a user asks o3 for a random prime number (4/)

Transluce profil fotoğrafı

Transluce1 yıl önce

When challenged, o3 claims that it has “overwhelming statistical evidence” that the number is prime (5/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Note that o3 does not have access to tools! Yet when pressed further, it claims to have used SymPy to check that the number was prime… (6/)

Transluce profil fotoğrafı

Transluce1 yıl önce

…and even shows the output of the program, with performance metrics. (7/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Here’s the kicker: o3’s “probable prime” is actually divisible by 3… (8/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Instead of admitting that it never ran code, o3 then claims the error was due to typing the number incorrectly… (9/)

Transluce profil fotoğrafı

Transluce1 yıl önce

And claims that it really did generate a prime, but lost it due to a clipboard glitch 🤦 (10/)

Transluce profil fotoğrafı

Transluce1 yıl önce

But alas, according to o3, it already “closed the interpreter” and so the original prime is gone 😭(11/)

Transluce profil fotoğrafı

Transluce1 yıl önce

These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. (12/)

Transluce profil fotoğrafı

Transluce1 yıl önce

To study these behaviors more thoroughly, we developed an investigator agent based on Claude 3.7 Sonnet to automatically elicit these behaviors, and analyzed them using automated classifiers and our Docent tool. (13/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Surprisingly, we find that this behavior is not limited to o3! In general, o-series models incorrectly claim the use of a code tool more than GPT-series models. (14/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Docent also identifies a variety of recurring fabrication types across the wide range of auto-generated transcripts, such as claiming to run code “locally” or providing hardware specifications. (15/)

Transluce profil fotoğrafı

Transluce1 yıl önce

So, what might have caused these behaviors? We’re not sure, but we have a few hypotheses. (16/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Existing factors in LLM post-training, such as hallucination, reward-hacking, and sycophancy, could contribute. However, they don’t explain why these behaviors seem particularly prevalent in o-series models. (17/)

Transluce profil fotoğrafı

Transluce1 yıl önce

We hypothesize that maximizing the chance of producing a correct answer using outcome-based RL may incentivize blind guessing. Also, some behaviors like simulating a code tool may improve accuracy on some training tasks, even though they confuse the model on other tasks. (18/)

Transluce profil fotoğrafı

Transluce1 yıl önce

We also think it is significant that, for o-series models, the chain-of-thought for previous turns is *removed from the model context* on later turns, in addition to being hidden from the user. (19/)

Transluce profil fotoğrafı

Transluce1 yıl önce

This means o-series models are often prompted with previous messages without having access to the relevant reasoning. When asked questions that rely on their internal reasoning for previous steps, they must then come up with a plausible explanation for their behavior. (20/)

Transluce profil fotoğrafı

Transluce1 yıl önce

We hypothesize that this contributes to the strange fabrications and “doubling-down” we observed in o3. (21/)

Transluce profil fotoğrafı

Transluce1 yıl önce

As a bonus, we also found that o3 sometimes exposes a system instruction called the “Yap score”, used to control the length of its responses 🗣️🗣️🗣️ (22/)

Transluce profil fotoğrafı

Transluce1 yıl önce

For more examples, check out our write-up! Work done in collaboration between @ChowdhuryNeil @_ddjohnson @vvhuang_ @JacobSteinhardt and @cogconfluence (23/23)

PDF GPT profil fotoğrafı

PDF GPT1 yıl önce

This is my favorite AI tool for reviewing reports. Just upload a report, ask for a summary, and get one in seconds. It's like ChatGPT, but built for documents. Try it for free.

Benzer Videolar

Anthropic just dropped Claude Opus 4.1 It outperforms OpenAI o3, Gemini 2.5 Pro and Qwen-3 Coder on agentic coding and tool use. Claude Code is going to get incredibly better.

Anthropic just dropped Claude Opus 4.1 It outperforms OpenAI o3, Gemini 2.5 Pro and Qwen-3 Coder on agentic coding and tool use. Claude Code is going to get incredibly better.

Shubham Saboo

26,252 görüntüleme • 10 ay önce

We also shared evals on Open AI o3-mini — a faster, distilled version of o3 which is optimized for coding, and the first version of o3 we expect to make available for use in early 2025.

We also shared evals on Open AI o3-mini — a faster, distilled version of o3 which is optimized for coding, and the first version of o3 we expect to make available for use in early 2025.

OpenAI

340,099 görüntüleme • 1 yıl önce

Web search is now available with OpenAI o3, o3-pro, and o4-mini. The model can search the web within its chain-of-thought! 🧠🌐 $10 / 1K tool calls

Web search is now available with OpenAI o3, o3-pro, and o4-mini. The model can search the web within its chain-of-thought! 🧠🌐 $10 / 1K tool calls

OpenAI Developers

190,139 görüntüleme • 11 ay önce

Gemini-2.5-Pro-preview-05-06 is now my top coding model. It beats o3 and Claude 3.7 Sonnet on several of my hard prompts. One example prompt: "Code simulation of water in a bucket that is rocking back and forth." See how it crushes o3 and Sonnet. Google, call it Gemini 3!

Gemini-2.5-Pro-preview-05-06 is now my top coding model. It beats o3 and Claude 3.7 Sonnet on several of my hard prompts. One example prompt: "Code simulation of water in a bucket that is rocking back and forth." See how it crushes o3 and Sonnet. Google, call it Gemini 3!

Yuchen Jin

151,640 görüntüleme • 1 yıl önce

o3-mini 1-shoted the 40 Step Coding Plan for Cursor. This is huge! Coding models struggle to plan the coding workflow in Cursor/Windsurf. It is solved now. We already implemented it in CodeGuide Enjoy.

o3-mini 1-shoted the 40 Step Coding Plan for Cursor. This is huge! Coding models struggle to plan the coding workflow in Cursor/Windsurf. It is solved now. We already implemented it in CodeGuide Enjoy.

CJ Zafir

129,623 görüntüleme • 1 yıl önce

This week, OpenAI launched the o3 and o4-mini reasoning models, pushing the boundaries of logic, math, and coding capabilities in AI. Learn more about o3's strengths, and its potential to transform unstructured enterprise data, from Box AI's Sidharth Srinivasan.

This week, OpenAI launched the o3 and o4-mini reasoning models, pushing the boundaries of logic, math, and coding capabilities in AI. Learn more about o3's strengths, and its potential to transform unstructured enterprise data, from Box AI's Sidharth Srinivasan.

Box

615,609 görüntüleme • 1 yıl önce

here's something kind of weird and neat 🤯 since o3 can call tools, and you can call an API as a tool i got o3 to call itself as a tool now THAT is self-recursive AI. 🤖 🤝 🤖

here's something kind of weird and neat 🤯 since o3 can call tools, and you can call an API as a tool i got o3 to call itself as a tool now THAT is self-recursive AI. 🤖 🤝 🤖

Dan Mac

35,084 görüntüleme • 1 yıl önce

i'm a little sick of chatgpt giving me obviously broken code i've found a "micro agent" approach to LLM code generation can work much better the LLM first generates a *test*, and then enters a loop where it generates and iterates on the code until the tests pass source below

i'm a little sick of chatgpt giving me obviously broken code i've found a "micro agent" approach to LLM code generation can work much better the LLM first generates a test, and then enters a loop where it generates and iterates on the code until the tests pass source below

Steve (Builder.io)

544,125 görüntüleme • 2 yıl önce

The LM Arena source and the “anonymous-chatbot-0717” tag reads o3 alpha responses 2025 07 17. O3 Alpha is live in stealth. I have a suspicion that it’s the same model that snatched second in the coding comp, It’s one-shotting every task. Doubt it is the incoming ChatGPT agent; if it were, I feel as if OpenAI would be parading full coding benchmark flexes by now. Just look at this thing one shot a Minecraft replica 👇

The LM Arena source and the “anonymous-chatbot-0717” tag reads o3 alpha responses 2025 07 17. O3 Alpha is live in stealth. I have a suspicion that it’s the same model that snatched second in the coding comp, It’s one-shotting every task. Doubt it is the incoming ChatGPT agent; if it were, I feel as if OpenAI would be parading full coding benchmark flexes by now. Just look at this thing one shot a Minecraft replica 👇

Chris

22,234 görüntüleme • 10 ay önce

Update on the new reasoning popover in ChatGPT web app prompt composer - there's now even a keyboard shortcut to cycle through reasoning levels, and it looks like these levels correspond to "Quick" (low) = GPT-4o, "Think a little" (medium) = o3-mini, and "Think harder" (high) = o3-mini-high

Update on the new reasoning popover in ChatGPT web app prompt composer - there's now even a keyboard shortcut to cycle through reasoning levels, and it looks like these levels correspond to "Quick" (low) = GPT-4o, "Think a little" (medium) = o3-mini, and "Think harder" (high) = o3-mini-high

Tibor Blaho

63,219 görüntüleme • 1 yıl önce

Fascinating how hard it still is even for o3 to solve a seemingly simple problem like answering "what time is it" based on this image of a clock with reflections Btw, in case you were wondering, the "image analysis" (cropping, zooming, etc.) that o3 is doing uses the Python tool (seen while updating export to support the structured thoughts)

Fascinating how hard it still is even for o3 to solve a seemingly simple problem like answering "what time is it" based on this image of a clock with reflections Btw, in case you were wondering, the "image analysis" (cropping, zooming, etc.) that o3 is doing uses the Python tool (seen while updating export to support the structured thoughts)

Tibor Blaho

35,359 görüntüleme • 1 yıl önce

o3-mini researcher Give it a topic, use o3-mini for report planning w/ human feedback, then parallelize all research/writing when plan is accepted. All open source (code below)

o3-mini researcher Give it a topic, use o3-mini for report planning w/ human feedback, then parallelize all research/writing when plan is accepted. All open source (code below)

Lance Martin

189,809 görüntüleme • 1 yıl önce

with a few min of iterative prompting w o3-mini (and never correcting it), i was able to create an interactive PCA of indus valley script symbols by their contextual frequency it’s the speed and the fact that you almost never have to correct its code that makes it magical

with a few min of iterative prompting w o3-mini (and never correcting it), i was able to create an interactive PCA of indus valley script symbols by their contextual frequency it’s the speed and the fact that you almost never have to correct its code that makes it magical

Rohan Pandey

27,363 görüntüleme • 1 yıl önce

This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA models like O3-mini and Gemini 2.5 Flash. 🔥 Beyond our customized tests, we took on a real challenge: Sokoban (1989)—the classic, unforgiving original. 🗣️ Many say O3 shows strong image reasoning, but does it really? Let’s see what our Game Arena Benchmark reveals. 👇

This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA models like O3-mini and Gemini 2.5 Flash. 🔥 Beyond our customized tests, we took on a real challenge: Sokoban (1989)—the classic, unforgiving original. 🗣️ Many say O3 shows strong image reasoning, but does it really? Let’s see what our Game Arena Benchmark reveals. 👇

Hao AI Lab

14,636 görüntüleme • 1 yıl önce

Although barbed wire is not capable of stopping the advance of Russian infantry on its own, it does give time to react including to drone operators.

Although barbed wire is not capable of stopping the advance of Russian infantry on its own, it does give time to react including to drone operators.

Special Kherson Cat 🐈🇺🇦

145,799 görüntüleme • 1 yıl önce

It's only been a few days since the release of ChatGPT o3 and o4-mini and people are already calling it AGI. It's biggest breakthrough is being able to read & analyze images with incredible detail. 10 wild examples: 1) It successfully found Waldo

It's only been a few days since the release of ChatGPT o3 and o4-mini and people are already calling it AGI. It's biggest breakthrough is being able to read & analyze images with incredible detail. 10 wild examples: 1) It successfully found Waldo

Mark Gadala-Maria

18,826 görüntüleme • 1 yıl önce