Loading video...

Video Failed to Load

Go Home

This Paper from Google DeepMind is a landmark one. 📚 "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" It may have contributed to the 01 Model from OpenAI or the principle may have been long known to OpenAI. The paper basically says - Searching...

48,980 views • 1 year ago •via X (Twitter)

6 Comments

Rohan Paul's profile picture
Rohan Paul1 year ago

📚

Jonas Vetterle's profile picture
Jonas Vetterle1 year ago

@GoogleDeepMind these podcasts generated by are so good btw, great way to stay up to date if you don't have much time 😄

Rohan Paul's profile picture
Rohan Paul1 year ago

@GoogleDeepMind Thanks Jonas. Yes great for a quick understanding within 5 minutes. In a single office commute, I can cover 4-5 papers.

GPT.Biz's profile picture
GPT.Biz1 year ago

@GoogleDeepMind This sounds fascinating! Definitely worth a read if you're interested in how compute strategies can boost LLM performance. Thanks for sharing!

Uncle J's profile picture
Uncle J1 year ago

@GoogleDeepMind Absolutely, it’s fascinating to see how these insights can reshape our understanding of model efficiency. The interplay between compute and parameters is such an important topic right now. Looking forward to diving deeper into the paper!

Mark G's profile picture
Mark G1 year ago

@GoogleDeepMind I don’t know who came first, Google or OpenAI, but rebalancing the compute load from training to inference is a great idea. (Just gotta run it on Groq or a new chip from @sama ).

Related Videos

The creator of High Bandwidth Memory said something that reframes the entire AI investment thesis, AI equals memory (Save this). Most people still think about AI hardware through a training lens. During training, the bottleneck is raw compute, GPUs stay near 100% utilization crunching through billions of gradient updates. Inference is a completely different problem. When a model generates a response, it produces tokens one at a time and at every single step, the entire model has to be loaded from memory into the processor to generate just one token. The GPU cores sit there, waiting for data to arrive. This is what engineers mean when they say inference is memory bound, the bottleneck is not how many calculations you can do per second but rather how fast you can move data from memory to the chip. Adding more GPUs does not fix a memory bandwidth problem, it just gives you more processors starving for the same data. Modern LLMs use a KV cache, a data structure that stores the conversation's context so the model does not have to recompute it from scratch on each step. The KV cache is what gives a model its memory of the conversation. It grows with every token and for long documents or deep reasoning chains, it can dwarf the model weights themselves in memory consumption. This means memory directly determines how long a context the model can hold, how many users you can serve simultaneously, how fast it responds and how cheaply you can run it. A memory constrained model is not just slower but rather qualitatively worse, it forgets earlier parts of the conversation, truncates context and hallucinates more because it literally cannot hold the relevant information long enough to use it. The world now spends more on inference than training, and every ChatGPT query, every Claude document analysis, every API call is an inference workload. Inference economics, cost per token, latency, context length, concurrent users are memory problems first and compute problems second. The companies that control memory bandwidth and supply are not suppliers to the AI trade but rather are the AI trade. Long Micron! Follow me Melvin for more AI, semis and the next big market themes.

Melvin

47,148 views • 2 days ago