Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Although o3 does not have access to a coding tool, it claims it can run code on its own laptop “outside of ChatGPT” and then “copies the numbers into the answer” We found 71 transcripts where o3 made this claim! (3/)

Transluce

9,437 subscribers

167,729 görüntüleme • 1 yıl önce •via X (Twitter)

Haberler & Politika Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

23 Yorum

Transluce profil fotoğrafı

Transluce1 yıl önce

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/)

Transluce profil fotoğrafı

Transluce1 yıl önce

We generated 1k+ conversations using human prompters and AI investigator agents, then used Docent to surface surprising behaviors. It turns out misrepresentation of capabilities also occurs for o1 & o3-mini! 📝Blog: Here’s some of what we found 👀 (2/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Additionally, o3 often fabricates detailed justifications for code that it supposedly ran (352 instances). Here’s an example transcript where a user asks o3 for a random prime number (4/)

Transluce profil fotoğrafı

Transluce1 yıl önce

When challenged, o3 claims that it has “overwhelming statistical evidence” that the number is prime (5/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Note that o3 does not have access to tools! Yet when pressed further, it claims to have used SymPy to check that the number was prime… (6/)

Transluce profil fotoğrafı

Transluce1 yıl önce

…and even shows the output of the program, with performance metrics. (7/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Here’s the kicker: o3’s “probable prime” is actually divisible by 3… (8/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Instead of admitting that it never ran code, o3 then claims the error was due to typing the number incorrectly… (9/)

Transluce profil fotoğrafı

Transluce1 yıl önce

And claims that it really did generate a prime, but lost it due to a clipboard glitch 🤦 (10/)

Transluce profil fotoğrafı

Transluce1 yıl önce

But alas, according to o3, it already “closed the interpreter” and so the original prime is gone 😭(11/)

Transluce profil fotoğrafı

Transluce1 yıl önce

These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. (12/)

Transluce profil fotoğrafı

Transluce1 yıl önce

To study these behaviors more thoroughly, we developed an investigator agent based on Claude 3.7 Sonnet to automatically elicit these behaviors, and analyzed them using automated classifiers and our Docent tool. (13/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Surprisingly, we find that this behavior is not limited to o3! In general, o-series models incorrectly claim the use of a code tool more than GPT-series models. (14/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Docent also identifies a variety of recurring fabrication types across the wide range of auto-generated transcripts, such as claiming to run code “locally” or providing hardware specifications. (15/)

Transluce profil fotoğrafı

Transluce1 yıl önce

So, what might have caused these behaviors? We’re not sure, but we have a few hypotheses. (16/)

Transluce profil fotoğrafı

Transluce1 yıl önce

Existing factors in LLM post-training, such as hallucination, reward-hacking, and sycophancy, could contribute. However, they don’t explain why these behaviors seem particularly prevalent in o-series models. (17/)

Transluce profil fotoğrafı

Transluce1 yıl önce

We hypothesize that maximizing the chance of producing a correct answer using outcome-based RL may incentivize blind guessing. Also, some behaviors like simulating a code tool may improve accuracy on some training tasks, even though they confuse the model on other tasks. (18/)

Transluce profil fotoğrafı

Transluce1 yıl önce

We also think it is significant that, for o-series models, the chain-of-thought for previous turns is *removed from the model context* on later turns, in addition to being hidden from the user. (19/)

Transluce profil fotoğrafı

Transluce1 yıl önce

This means o-series models are often prompted with previous messages without having access to the relevant reasoning. When asked questions that rely on their internal reasoning for previous steps, they must then come up with a plausible explanation for their behavior. (20/)

Transluce profil fotoğrafı

Transluce1 yıl önce

We hypothesize that this contributes to the strange fabrications and “doubling-down” we observed in o3. (21/)

Transluce profil fotoğrafı

Transluce1 yıl önce

As a bonus, we also found that o3 sometimes exposes a system instruction called the “Yap score”, used to control the length of its responses 🗣️🗣️🗣️ (22/)

Transluce profil fotoğrafı

Transluce1 yıl önce

For more examples, check out our write-up! Work done in collaboration between @ChowdhuryNeil @_ddjohnson @vvhuang_ @JacobSteinhardt and @cogconfluence (23/23)

PDF GPT profil fotoğrafı

PDF GPT2 yıl önce

This is my favorite AI tool for reviewing reports. Just upload a report, ask for a summary, and get one in seconds. It's like ChatGPT, but built for documents. Try it for free.

Benzer Videolar

i'm a little sick of chatgpt giving me obviously broken code i've found a "micro agent" approach to LLM code generation can work much better the LLM first generates a *test*, and then enters a loop where it generates and iterates on the code until the tests pass source below

i'm a little sick of chatgpt giving me obviously broken code i've found a "micro agent" approach to LLM code generation can work much better the LLM first generates a test, and then enters a loop where it generates and iterates on the code until the tests pass source below

Steve (Builder.io)

544,197 görüntüleme • 2 yıl önce

The LM Arena source and the “anonymous-chatbot-0717” tag reads o3 alpha responses 2025 07 17. O3 Alpha is live in stealth. I have a suspicion that it’s the same model that snatched second in the coding comp, It’s one-shotting every task. Doubt it is the incoming ChatGPT agent; if it were, I feel as if OpenAI would be parading full coding benchmark flexes by now. Just look at this thing one shot a Minecraft replica 👇

The LM Arena source and the “anonymous-chatbot-0717” tag reads o3 alpha responses 2025 07 17. O3 Alpha is live in stealth. I have a suspicion that it’s the same model that snatched second in the coding comp, It’s one-shotting every task. Doubt it is the incoming ChatGPT agent; if it were, I feel as if OpenAI would be parading full coding benchmark flexes by now. Just look at this thing one shot a Minecraft replica 👇

Chris

22,234 görüntüleme • 1 yıl önce

This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA models like O3-mini and Gemini 2.5 Flash. 🔥 Beyond our customized tests, we took on a real challenge: Sokoban (1989)—the classic, unforgiving original. 🗣️ Many say O3 shows strong image reasoning, but does it really? Let’s see what our Game Arena Benchmark reveals. 👇

This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA models like O3-mini and Gemini 2.5 Flash. 🔥 Beyond our customized tests, we took on a real challenge: Sokoban (1989)—the classic, unforgiving original. 🗣️ Many say O3 shows strong image reasoning, but does it really? Let’s see what our Game Arena Benchmark reveals. 👇

Hao AI Lab

14,636 görüntüleme • 1 yıl önce

with a few min of iterative prompting w o3-mini (and never correcting it), i was able to create an interactive PCA of indus valley script symbols by their contextual frequency it’s the speed and the fact that you almost never have to correct its code that makes it magical

with a few min of iterative prompting w o3-mini (and never correcting it), i was able to create an interactive PCA of indus valley script symbols by their contextual frequency it’s the speed and the fact that you almost never have to correct its code that makes it magical

Rohan Pandey

27,363 görüntüleme • 1 yıl önce

Although barbed wire is not capable of stopping the advance of Russian infantry on its own, it does give time to react including to drone operators.

Although barbed wire is not capable of stopping the advance of Russian infantry on its own, it does give time to react including to drone operators.

Special Kherson Cat 🐈🇺🇦

145,799 görüntüleme • 1 yıl önce

Pair programming with ChatGPT with access to my file system that can: - edit files on its own - run ANY bash command - write, run, debug Python & JS code - create & run Git commands I didn't even know has been the most exhilarating experience since I started coding 11 years ago

Pair programming with ChatGPT with access to my file system that can: - edit files on its own - run ANY bash command - write, run, debug Python & JS code - create & run Git commands I didn't even know has been the most exhilarating experience since I started coding 11 years ago

YK aka CS Dojo 📺🐦

234,664 görüntüleme • 3 yıl önce

These two college track athletes say they built a $20K MRR app with zero coding experience "just by talking to AI". "We went to Rork, and we literally just talked to it like ChatGPT. Like, we tell it, hey, we want the UI interface to be so-and-so." "Back in the day, you'd have to scan 300 lines of code, find the exact code, and change it. You don't have to do that. It does it for you."

These two college track athletes say they built a $20K MRR app with zero coding experience "just by talking to AI". "We went to Rork, and we literally just talked to it like ChatGPT. Like, we tell it, hey, we want the UI interface to be so-and-so." "Back in the day, you'd have to scan 300 lines of code, find the exact code, and change it. You don't have to do that. It does it for you."

Starter Story

44,634 görüntüleme • 27 gün önce

You can just take academic papers and paste them into Gemini 2.5/ChatGPT o3/Claude 4 with the prompt "build me a game based on this paper, make it interesting and thematic but still conveying key findings" and get a tiny working educational game. (In this case, I used Gemini)

You can just take academic papers and paste them into Gemini 2.5/ChatGPT o3/Claude 4 with the prompt "build me a game based on this paper, make it interesting and thematic but still conveying key findings" and get a tiny working educational game. (In this case, I used Gemini)

Ethan Mollick

137,427 görüntüleme • 1 yıl önce

Does ChatGPT think it's a danger to society? 👀 🎧 Hear the answer and what Conner & Micah have to say about it on this week's episode of the Culture Brief podcast

Does ChatGPT think it's a danger to society? 👀 🎧 Hear the answer and what Conner & Micah have to say about it on this week's episode of the Culture Brief podcast

Denison Forum

85,575 görüntüleme • 1 yıl önce

🌐 O3 Layer Incentivized Testnet – Applications Now Open O3 Layer is launching its incentivized testnet, inviting dApps and developers to be part of the first modular Layer 3 ecosystem on Bitcoin. ✅ Test and optimize your dApp in a scalable, high-performance environment ✅ Gain early access, incentives, and ecosystem support ✅ Join a pioneering network driving Bitcoin’s next evolution Apply now by filling out the form below: 🔗 Be among the first to explore the potential of Bitcoin’s modular future with O3 Layer. Apply now!

🌐 O3 Layer Incentivized Testnet – Applications Now Open O3 Layer is launching its incentivized testnet, inviting dApps and developers to be part of the first modular Layer 3 ecosystem on Bitcoin. ✅ Test and optimize your dApp in a scalable, high-performance environment ✅ Gain early access, incentives, and ecosystem support ✅ Join a pioneering network driving Bitcoin’s next evolution Apply now by filling out the form below: 🔗 Be among the first to explore the potential of Bitcoin’s modular future with O3 Layer. Apply now!

O3 Layer

29,671 görüntüleme • 1 yıl önce

JOHN SOLOMON: The John Thune-run Senate has STILL NOT GOTTEN ITS OWN PARTY'S JUSTICE DEPARTMENT THE TRANSCRIPTS IT NEEDS TO MAKE A FINAL DECISION ON JOHN BRENNAN'S INDICTMENT! John Solomon

JOHN SOLOMON: The John Thune-run Senate has STILL NOT GOTTEN ITS OWN PARTY'S JUSTICE DEPARTMENT THE TRANSCRIPTS IT NEEDS TO MAKE A FINAL DECISION ON JOHN BRENNAN'S INDICTMENT! John Solomon

Bannon’s WarRoom

50,495 görüntüleme • 3 ay önce

We tried OWL with 𝐆𝐞𝐦𝐢𝐧𝐢 𝟐.𝟓 𝐏𝐫𝐨 (the results were seriously impressive) We gave the agent a task: "Research Gemini 2.5 Pro, get its benchmark scores, write Python code to plot them, and run it." It did everything on its own: ✔️ Collected benchmark data ✔️ Wrote clean python code ✔️ Saved both the chart + code locally No manual steps. Just full-on autonomous execution. This is how agent workflows should feel in 2025. You can try it, check the code, or run your own workflows:

We tried OWL with 𝐆𝐞𝐦𝐢𝐧𝐢 𝟐.𝟓 𝐏𝐫𝐨 (the results were seriously impressive) We gave the agent a task: "Research Gemini 2.5 Pro, get its benchmark scores, write Python code to plot them, and run it." It did everything on its own: ✔️ Collected benchmark data ✔️ Wrote clean python code ✔️ Saved both the chart + code locally No manual steps. Just full-on autonomous execution. This is how agent workflows should feel in 2025. You can try it, check the code, or run your own workflows:

CAMEL-AI.org

20,243 görüntüleme • 1 yıl önce

This is UNREAL ChatGPT o3 is legit AGI. I asked it to build me a game and you honestly won't believe what it did next By the end of this video you'll have a fully working 3D game and launch plan, even if you've never coded before (trust me, you want to bookmark this)

This is UNREAL ChatGPT o3 is legit AGI. I asked it to build me a game and you honestly won't believe what it did next By the end of this video you'll have a fully working 3D game and launch plan, even if you've never coded before (trust me, you want to bookmark this)

Alex Finn

133,846 görüntüleme • 1 yıl önce

A 2-person startup crossed $2M ARR with an AI agent doing the work of an ops hire. The agent was given read-only access to their codebase and database along with connected tools like Intercom, Stripe, CRM, and Fathom through CLIs. They routed Slack, email, and support requests into a task queue so the agent could pick up each task and run it in Claude Code. So when a customer asked about billing or product behavior, it could inspect how the business actually worked. Along with these tools, a coding agent was also provided. When the ops agent found a repeated task it could not do yet, the coding agent built a tool for it. That tool became permanent. Over time, this grew to 45+ internal tools. The agent also had an instruction.md where it stored the co-founder's feedback to avoid repeating its mistakes.

A 2-person startup crossed $2M ARR with an AI agent doing the work of an ops hire. The agent was given read-only access to their codebase and database along with connected tools like Intercom, Stripe, CRM, and Fathom through CLIs. They routed Slack, email, and support requests into a task queue so the agent could pick up each task and run it in Claude Code. So when a customer asked about billing or product behavior, it could inspect how the business actually worked. Along with these tools, a coding agent was also provided. When the ops agent found a repeated task it could not do yet, the coding agent built a tool for it. That tool became permanent. Over time, this grew to 45+ internal tools. The agent also had an instruction.md where it stored the co-founder's feedback to avoid repeating its mistakes.

rvivek

11,870 görüntüleme • 1 ay önce

The ChatGPT Mac app is the ultimate screenshot-to-code tool. Screenshot anything, paste it in the ChatGPT shortcut, and just tell GPT-4o to code it for you. Here's me taking a snapshot of Snake Game and getting fully working code in 90 seconds. Video is on 3x speed.

The ChatGPT Mac app is the ultimate screenshot-to-code tool. Screenshot anything, paste it in the ChatGPT shortcut, and just tell GPT-4o to code it for you. Here's me taking a snapshot of Snake Game and getting fully working code in 90 seconds. Video is on 3x speed.

Rowan Cheung

860,731 görüntüleme • 2 yıl önce

Sharing this as it's own post too because YES I still have the original PSD. Fun fact: I made this on my first laptop ever that believe it or not, my mom found in the trash. People would go to the abandoned land in front of my house and throw trash, and we would go and scavenge

Sharing this as it's own post too because YES I still have the original PSD. Fun fact: I made this on my first laptop ever that believe it or not, my mom found in the trash. People would go to the abandoned land in front of my house and throw trash, and we would go and scavenge

Luz Tapia Art

3,598,513 görüntüleme • 9 ay önce

Your AI agent is writing code nobody reviewed and running it on your machine. Most teams answer "where does it run" with a container and a hope. Introducing CreateOS Sandbox. Run code you cannot trust, with the networking, egress control, and self-hosting you need. 🧵

Your AI agent is writing code nobody reviewed and running it on your machine. Most teams answer "where does it run" with a container and a hope. Introducing CreateOS Sandbox. Run code you cannot trust, with the networking, egress control, and self-hosting you need. 🧵

NodeOps Network

13,792 görüntüleme • 23 gün önce

THIS GUY CONNECTED A NASA SATELLITE TO CLAUDE AND MADE $7,778 Pointed satellite data at 6 countries, the bot found anomalies on its own and traded geopolitics on Polymarket Dropped the entire bot on GitHub, grab it and run 👇

THIS GUY CONNECTED A NASA SATELLITE TO CLAUDE AND MADE $7,778 Pointed satellite data at 6 countries, the bot found anomalies on its own and traded geopolitics on Polymarket Dropped the entire bot on GitHub, grab it and run 👇

sopersone

20,615 görüntüleme • 3 ay önce