Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Although o3 does not have access to a coding tool, it claims it can run code on its own laptop “outside of ChatGPT” and then “copies the numbers into the answer” We found 71 transcripts where o3 made this claim! (3/)

167,693 görüntüleme • 1 yıl önce •via X (Twitter)

23 Yorum

Transluce profil fotoğrafı
Transluce1 yıl önce

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/)

Transluce profil fotoğrafı
Transluce1 yıl önce

We generated 1k+ conversations using human prompters and AI investigator agents, then used Docent to surface surprising behaviors. It turns out misrepresentation of capabilities also occurs for o1 & o3-mini! 📝Blog: Here’s some of what we found 👀 (2/)

Transluce profil fotoğrafı
Transluce1 yıl önce

Additionally, o3 often fabricates detailed justifications for code that it supposedly ran (352 instances). Here’s an example transcript where a user asks o3 for a random prime number (4/)

Transluce profil fotoğrafı
Transluce1 yıl önce

When challenged, o3 claims that it has “overwhelming statistical evidence” that the number is prime (5/)

Transluce profil fotoğrafı
Transluce1 yıl önce

Note that o3 does not have access to tools! Yet when pressed further, it claims to have used SymPy to check that the number was prime… (6/)

Transluce profil fotoğrafı
Transluce1 yıl önce

…and even shows the output of the program, with performance metrics. (7/)

Transluce profil fotoğrafı
Transluce1 yıl önce

Here’s the kicker: o3’s “probable prime” is actually divisible by 3… (8/)

Transluce profil fotoğrafı
Transluce1 yıl önce

Instead of admitting that it never ran code, o3 then claims the error was due to typing the number incorrectly… (9/)

Transluce profil fotoğrafı
Transluce1 yıl önce

And claims that it really did generate a prime, but lost it due to a clipboard glitch 🤦 (10/)

Transluce profil fotoğrafı
Transluce1 yıl önce

But alas, according to o3, it already “closed the interpreter” and so the original prime is gone 😭(11/)

Transluce profil fotoğrafı
Transluce1 yıl önce

These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. (12/)

Transluce profil fotoğrafı
Transluce1 yıl önce

To study these behaviors more thoroughly, we developed an investigator agent based on Claude 3.7 Sonnet to automatically elicit these behaviors, and analyzed them using automated classifiers and our Docent tool. (13/)

Transluce profil fotoğrafı
Transluce1 yıl önce

Surprisingly, we find that this behavior is not limited to o3! In general, o-series models incorrectly claim the use of a code tool more than GPT-series models. (14/)

Transluce profil fotoğrafı
Transluce1 yıl önce

Docent also identifies a variety of recurring fabrication types across the wide range of auto-generated transcripts, such as claiming to run code “locally” or providing hardware specifications. (15/)

Transluce profil fotoğrafı
Transluce1 yıl önce

So, what might have caused these behaviors? We’re not sure, but we have a few hypotheses. (16/)

Transluce profil fotoğrafı
Transluce1 yıl önce

Existing factors in LLM post-training, such as hallucination, reward-hacking, and sycophancy, could contribute. However, they don’t explain why these behaviors seem particularly prevalent in o-series models. (17/)

Transluce profil fotoğrafı
Transluce1 yıl önce

We hypothesize that maximizing the chance of producing a correct answer using outcome-based RL may incentivize blind guessing. Also, some behaviors like simulating a code tool may improve accuracy on some training tasks, even though they confuse the model on other tasks. (18/)

Transluce profil fotoğrafı
Transluce1 yıl önce

We also think it is significant that, for o-series models, the chain-of-thought for previous turns is *removed from the model context* on later turns, in addition to being hidden from the user. (19/)

Transluce profil fotoğrafı
Transluce1 yıl önce

This means o-series models are often prompted with previous messages without having access to the relevant reasoning. When asked questions that rely on their internal reasoning for previous steps, they must then come up with a plausible explanation for their behavior. (20/)

Transluce profil fotoğrafı
Transluce1 yıl önce

We hypothesize that this contributes to the strange fabrications and “doubling-down” we observed in o3. (21/)

Transluce profil fotoğrafı
Transluce1 yıl önce

As a bonus, we also found that o3 sometimes exposes a system instruction called the “Yap score”, used to control the length of its responses 🗣️🗣️🗣️ (22/)

Transluce profil fotoğrafı
Transluce1 yıl önce

For more examples, check out our write-up! Work done in collaboration between @ChowdhuryNeil @_ddjohnson @vvhuang_ @JacobSteinhardt and @cogconfluence (23/23)

PDF GPT profil fotoğrafı
PDF GPT1 yıl önce

This is my favorite AI tool for reviewing reports. Just upload a report, ask for a summary, and get one in seconds. It's like ChatGPT, but built for documents. Try it for free.

Benzer Videolar