Peter Gostev's banner
Peter Gostev's profile picture

Peter Gostev

@petergostev11,966 subscribers

London 🇬🇧 AI Capability @ https://t.co/Y4VEGWofQ8 https://t.co/bkfw1nxdmJ

Shorts

That's quite cute in codex

That's quite cute in codex

237,841 просмотров

GPT-5.5 by Reasoning Effort: I've asked it in Codex to create a physics-based visualisation of RL cycles for different sized models (70b, 1t, 10t), to demonstrate how the amount of RL you can do differs by model size. My assessment of each: - Low: weird slop - Medium: kinda cooked - High: sort of tried but ultimately incoherent - Extra High: elite - really nice idea and well executed Obviously this is just one shot, but worth trying different reasoning levels for the new models, medium seems to be pretty good for GPT-5.5 and it was really bad for many previous GPT models.

GPT-5.5 by Reasoning Effort: I've asked it in Codex to create a physics-based visualisation of RL cycles for different sized models (70b, 1t, 10t), to demonstrate how the amount of RL you can do differs by model size. My assessment of each: - Low: weird slop - Medium: kinda cooked - High: sort of tried but ultimately incoherent - Extra High: elite - really nice idea and well executed Obviously this is just one shot, but worth trying different reasoning levels for the new models, medium seems to be pretty good for GPT-5.5 and it was really bad for many previous GPT models.

208,732 просмотров

Got my dog as a Codex pet, but more interestingly got Codex to add the rings to show my Codex limits. Outer ring is 5 hours, inner the weekly one

Got my dog as a Codex pet, but more interestingly got Codex to add the rings to show my Codex limits. Outer ring is 5 hours, inner the weekly one

164,795 просмотров

GPT-5.4-Pro (Extended) This took 87m 90 seconds (I apologise Sam Altman), I'll pull together some very impressive results soon

GPT-5.4-Pro (Extended) This took 87m 90 seconds (I apologise Sam Altman), I'll pull together some very impressive results soon

232,032 просмотров

GPT-5.5 is MUCH more reliable on longer running tasks - for the first time with any model. As we speak I have a migration running for over 7+ hours - this literally never happened before, the models would maybe run for 30 mins or of you really shout at them for 2-3 hours. Last night I went to sleep, set a long running task, then queued up 10 prompts to 'keep it going'. It did not stop after the first prompt and kept going for 8+ hours and I woke up to all the same prompts still queued up. The ability to run for a long time, in combination with ability to validate with computer use & other tools, makes it much more useful for building real applications.

GPT-5.5 is MUCH more reliable on longer running tasks - for the first time with any model. As we speak I have a migration running for over 7+ hours - this literally never happened before, the models would maybe run for 30 mins or of you really shout at them for 2-3 hours. Last night I went to sleep, set a long running task, then queued up 10 prompts to 'keep it going'. It did not stop after the first prompt and kept going for 8+ hours and I woke up to all the same prompts still queued up. The ability to run for a long time, in combination with ability to validate with computer use & other tools, makes it much more useful for building real applications.

105,594 просмотров

I think this is my favourite Sora 2 video I generated - Cleopatra visiting modern Egypt

I think this is my favourite Sora 2 video I generated - Cleopatra visiting modern Egypt

371,270 просмотров

I'm sure you've all noticed the 'AI is slowing down' news stories every few weeks for multiple years now - so I've pulled a tracker together to see who and when wrote these stories. There is quite a range, some are just outright wrong, others point to a reasonable limitation at the time but missing the bigger arc of progress. All of these stories were appearing as we were getting reasoning models, open source models, increasing competition from more players and skyrocketing revenue for the labs. Link to the tracker in the comments

I'm sure you've all noticed the 'AI is slowing down' news stories every few weeks for multiple years now - so I've pulled a tracker together to see who and when wrote these stories. There is quite a range, some are just outright wrong, others point to a reasonable limitation at the time but missing the bigger arc of progress. All of these stories were appearing as we were getting reasoning models, open source models, increasing competition from more players and skyrocketing revenue for the labs. Link to the tracker in the comments

37,443 просмотров

New models on the lmarena.ai WebDev arena: - Lobster - Nectarine - Starfish (not in this video) In the video compared to the 'Anonymous Chatbot' (aka o3-Alpha) from 17th July. Observations: - Lobster is closest to the o3-Alpha, but nowhere near as good - Nectarine was not very impressive, below models like Kimi-K2 - Starfish - I didn't manage to capture, but the one time I got it, it wasn't very good

New models on the lmarena.ai WebDev arena: - Lobster - Nectarine - Starfish (not in this video) In the video compared to the 'Anonymous Chatbot' (aka o3-Alpha) from 17th July. Observations: - Lobster is closest to the o3-Alpha, but nowhere near as good - Nectarine was not very impressive, below models like Kimi-K2 - Starfish - I didn't manage to capture, but the one time I got it, it wasn't very good

20,961 просмотров

Videos

petergostev's profile picture

Opus 4.7 - 400k vs 1m context - is there a difference? I've heard Theo - t3.gg talk about the fact that it is unlikely that Anthropic would have offered up a model with 1m context at the same cost, if it wasn't a different (i.e. cheaper to serve) model. I did a test where I toggled the 1m default model on & off in Claude Code (otherwise default settings, xHigh reasoning) and compared the outputs with 3x generations - same prompts etc. My observations: - Models feel DIFFERENT - often when you ask a model for the same generation, you get a somewhat different answer, but it feels & smells the same. Here 400k and 1m are very different every time - 400k model seems better - not that 1m is trash and 400k is amazing, but there are definitely issues with the level of ambition and accuracy that 1m model seems to have Examples of 1m failing: - Voxel Rome: the colosseum is nowhere near as impressive - Golden Gate: cars go sideways, waves not very high, bridge goes into land; though the structure of the bridge is a bit better - Stonehenge: structure is more 'wrong', lighting, shadows & textures are more flat and not as rich This isn't a conclusive evidence of course, but at least to me the two models do not behave the same way. Anecdotally as well when building 1m felt like it was doing more weird validation (e.g. going around in circles) and 400k was more straightforward. These sorts of things are harder to capture in tests, but you'd notice in Claude Code. You can review the hosted generations, see the code & prompts in the links below

Peter Gostev

29,009 просмотров • 1 месяц назад