Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

This Paper from Google DeepMind is a landmark one. 📚 "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" It may have contributed to the 01 Model from OpenAI or the principle may have been long known to OpenAI. The paper basically says - Searching... show more

Rohan Paul

63,549 subscribers

48,980 views • 1 year ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

6 Comments

Rohan Paul1 year ago

📚

Jonas Vetterle1 year ago

@GoogleDeepMind these podcasts generated by are so good btw, great way to stay up to date if you don't have much time 😄

Rohan Paul1 year ago

@GoogleDeepMind Thanks Jonas. Yes great for a quick understanding within 5 minutes. In a single office commute, I can cover 4-5 papers.

GPT.Biz1 year ago

@GoogleDeepMind This sounds fascinating! Definitely worth a read if you're interested in how compute strategies can boost LLM performance. Thanks for sharing!

Uncle J1 year ago

@GoogleDeepMind Absolutely, it’s fascinating to see how these insights can reshape our understanding of model efficiency. The interplay between compute and parameters is such an important topic right now. Looking forward to diving deeper into the paper!

Mark G1 year ago

@GoogleDeepMind I don’t know who came first, Google or OpenAI, but rebalancing the compute load from training to inference is a great idea. (Just gotta run it on Groq or a new chip from @sama ).

Related Videos

Paper - "LLMs Will Always Hallucinate, and We Need to Live With This" Podcast format generated with Google's new illuminate tool (illuminate is trained to produce short podcast from research papers)

Paper - "LLMs Will Always Hallucinate, and We Need to Live With This" Podcast format generated with Google's new illuminate tool (illuminate is trained to produce short podcast from research papers)

Rohan Paul

36,537 views • 1 year ago

$A crypto project actually trained a 72B parameter AI model from scratch using decentralized GPU compute. Not fine-tuned, not a wrapper: trained from zero. The model benchmarks competitively against Meta's LLaMA 3 on reasoning tasks, and the entire training run cost a fraction of what centralized labs spend. If decentralized compute can produce frontier-class models, the moat around OpenAI and Anthropic is thinner than people think.$

A crypto project actually trained a 72B parameter AI model from scratch using decentralized GPU compute. Not fine-tuned, not a wrapper: trained from zero. The model benchmarks competitively against Meta's LLaMA 3 on reasoning tasks, and the entire training run cost a fraction of what centralized labs spend. If decentralized compute can produce frontier-class models, the moat around OpenAI and Anthropic is thinner than people think.

VirtualBacon

40,543 views • 2 months ago

#M5StackNew 🎊 The LLM630 Compute Kit is an #AI large language model (#LLM) inference development kit, powered by the #Axera #AX630C SoC with a 3.2 TOPs NPU, it delivers efficient AI inference for tasks like computer vision (CV) and LLM processing.

#M5StackNew 🎊 The LLM630 Compute Kit is an #AI large language model (#LLM) inference development kit, powered by the #Axera #AX630C SoC with a 3.2 TOPs NPU, it delivers efficient AI inference for tasks like computer vision (CV) and LLM processing.

M5Stack

16,383 views • 1 year ago

OpenAI's Noam Brown says that while AI model performance scales roughly equivalently with more training or inference compute, the cost of inference is on the order of 100 billion times cheaper

OpenAI's Noam Brown says that while AI model performance scales roughly equivalently with more training or inference compute, the cost of inference is on the order of 100 billion times cheaper

Tsarathustra

250,828 views • 1 year ago

OpenAI's Noam Brown says the hardest research problem on the road to superintelligence has been solved in the form of scaling inference-time compute

OpenAI's Noam Brown says the hardest research problem on the road to superintelligence has been solved in the form of scaling inference-time compute

Tsarathustra

84,310 views • 1 year ago

ELON MUSK: "The most obvious one is actually solar powered AI satellites, sort of to move the AI to orbit and essentially deep space over time, because there you can actually access over a billion times more energy from the sun in deep space than you can on Earth. The scaling to Kardashev, which is using some non trivial amount of energy from the sun. You kind of have to do space solar power. Like to even use a millionth of a percent of the sun's energy you you really have to be have your solar power in deep space."

ELON MUSK: "The most obvious one is actually solar powered AI satellites, sort of to move the AI to orbit and essentially deep space over time, because there you can actually access over a billion times more energy from the sun in deep space than you can on Earth. The scaling to Kardashev, which is using some non trivial amount of energy from the sun. You kind of have to do space solar power. Like to even use a millionth of a percent of the sun's energy you you really have to be have your solar power in deep space."

DogeDesigner

266,777 views • 7 months ago

The creator of High Bandwidth Memory said something that reframes the entire AI investment thesis, AI equals memory (Save this). Most people still think about AI hardware through a training lens. During training, the bottleneck is raw compute, GPUs stay near 100% utilization crunching through billions of gradient updates. Inference is a completely different problem. When a model generates a response, it produces tokens one at a time and at every single step, the entire model has to be loaded from memory into the processor to generate just one token. The GPU cores sit there, waiting for data to arrive. This is what engineers mean when they say inference is memory bound, the bottleneck is not how many calculations you can do per second but rather how fast you can move data from memory to the chip. Adding more GPUs does not fix a memory bandwidth problem, it just gives you more processors starving for the same data. Modern LLMs use a KV cache, a data structure that stores the conversation's context so the model does not have to recompute it from scratch on each step. The KV cache is what gives a model its memory of the conversation. It grows with every token and for long documents or deep reasoning chains, it can dwarf the model weights themselves in memory consumption. This means memory directly determines how long a context the model can hold, how many users you can serve simultaneously, how fast it responds and how cheaply you can run it. A memory constrained model is not just slower but rather qualitatively worse, it forgets earlier parts of the conversation, truncates context and hallucinates more because it literally cannot hold the relevant information long enough to use it. The world now spends more on inference than training, and every ChatGPT query, every Claude document analysis, every API call is an inference workload. Inference economics, cost per token, latency, context length, concurrent users are memory problems first and compute problems second. The companies that control memory bandwidth and supply are not suppliers to the AI trade but rather are the AI trade. Long Micron! Follow me Melvin for more AI, semis and the next big market themes.

The creator of High Bandwidth Memory said something that reframes the entire AI investment thesis, AI equals memory (Save this). Most people still think about AI hardware through a training lens. During training, the bottleneck is raw compute, GPUs stay near 100% utilization crunching through billions of gradient updates. Inference is a completely different problem. When a model generates a response, it produces tokens one at a time and at every single step, the entire model has to be loaded from memory into the processor to generate just one token. The GPU cores sit there, waiting for data to arrive. This is what engineers mean when they say inference is memory bound, the bottleneck is not how many calculations you can do per second but rather how fast you can move data from memory to the chip. Adding more GPUs does not fix a memory bandwidth problem, it just gives you more processors starving for the same data. Modern LLMs use a KV cache, a data structure that stores the conversation's context so the model does not have to recompute it from scratch on each step. The KV cache is what gives a model its memory of the conversation. It grows with every token and for long documents or deep reasoning chains, it can dwarf the model weights themselves in memory consumption. This means memory directly determines how long a context the model can hold, how many users you can serve simultaneously, how fast it responds and how cheaply you can run it. A memory constrained model is not just slower but rather qualitatively worse, it forgets earlier parts of the conversation, truncates context and hallucinates more because it literally cannot hold the relevant information long enough to use it. The world now spends more on inference than training, and every ChatGPT query, every Claude document analysis, every API call is an inference workload. Inference economics, cost per token, latency, context length, concurrent users are memory problems first and compute problems second. The companies that control memory bandwidth and supply are not suppliers to the AI trade but rather are the AI trade. Long Micron! Follow me Melvin for more AI, semis and the next big market themes.

Melvin

47,148 views • 2 days ago

Understanding OpenAI o1: Noam Brown on integrating reasoning into the model. Takeaways: - Avoid MCTS and current paradigm of using processes outside of the model during inference - Think about how to directly integrate reasoning into the model architecture

Understanding OpenAI o1: Noam Brown on integrating reasoning into the model. Takeaways: - Avoid MCTS and current paradigm of using processes outside of the model during inference - Think about how to directly integrate reasoning into the model architecture

Casper Hansen

313,291 views • 1 year ago

How to train a reasoning model, specifically Magistral from Mistral AI? Here is a quick video walkthrough of the paper.

How to train a reasoning model, specifically Magistral from Mistral AI? Here is a quick video walkthrough of the paper.

Sophia Yang, Ph.D.

31,752 views • 1 year ago

1. Sam Altman on Will All Startups Be Killed by OpenAI There are 2 strategies to build on AI right now: Assume the model won’t improve. Assume OpenAI will stay on the same growth trajectory. 95% should bet on #2. We have a mission to improve the model, so we will steamroll you.

1. Sam Altman on Will All Startups Be Killed by OpenAI There are 2 strategies to build on AI right now: Assume the model won’t improve. Assume OpenAI will stay on the same growth trajectory. 95% should bet on #2. We have a mission to improve the model, so we will steamroll you.

Harry Stebbings

212,271 views • 2 years ago

OpenAI CPO Kevin Weil says "the AI model you're using today is the worst AI model you will ever use for the rest of your life."

OpenAI CPO Kevin Weil says "the AI model you're using today is the worst AI model you will ever use for the rest of your life."

vitrupo

31,118 views • 1 year ago

AI inference is already live on UOMI Router. Access frontier open-source models through a decentralized inference network with: • Low-cost inference • OpenAI-compatible APIs • Verifiable compute • On-chain settlement Start building: And if you have idle GPUs, there's now a second way to participate. The UOMI Provider whitelist is officially open. Instead of letting your hardware sit unused, you can contribute GPU power to the UOMI Inference Network and earn from real AI demand. Join the Provider whitelist:

AI inference is already live on UOMI Router. Access frontier open-source models through a decentralized inference network with: • Low-cost inference • OpenAI-compatible APIs • Verifiable compute • On-chain settlement Start building: And if you have idle GPUs, there's now a second way to participate. The UOMI Provider whitelist is officially open. Instead of letting your hardware sit unused, you can contribute GPU power to the UOMI Inference Network and earn from real AI demand. Join the Provider whitelist:

Uomi

18,754 views • 10 days ago

🆕Scaling Test Time Compute to Multi-Agent Civilizations, with Noam Brown We're excited to publish our full conversation with Noam Brown on the frontiers of the new reasoning paradigm at OpenAI! - first principles for starting the "Multi-Agents" team - what's not captured by the "System 1/System 2" analogy for inference time compute - how Ilya Sutskever convinced him that reasoning was closer than he thought - Deep Research is existence proof that RL generalizes beyond verifiable rewards - the relationship between AI for imperfect information games (like Poker, Stratego, Diplomacy) and reasoning Enjoy! on youtube, or wherever fine podcasts are sold.

🆕Scaling Test Time Compute to Multi-Agent Civilizations, with Noam Brown We're excited to publish our full conversation with Noam Brown on the frontiers of the new reasoning paradigm at OpenAI! - first principles for starting the "Multi-Agents" team - what's not captured by the "System 1/System 2" analogy for inference time compute - how Ilya Sutskever convinced him that reasoning was closer than he thought - Deep Research is existence proof that RL generalizes beyond verifiable rewards - the relationship between AI for imperfect information games (like Poker, Stratego, Diplomacy) and reasoning Enjoy! on youtube, or wherever fine podcasts are sold.

Latent.Space

105,905 views • 1 year ago

Leaders of all top AI model makers from OpenAI, Anthropic, Google DeepMind including Indian models like Sarvam signing voluntary declarations on response AI development and use. At the historic #IndiaAIImpactSummit2026. Pardon the chatter behind

Leaders of all top AI model makers from OpenAI, Anthropic, Google DeepMind including Indian models like Sarvam signing voluntary declarations on response AI development and use. At the historic #IndiaAIImpactSummit2026. Pardon the chatter behind

Mausam (IITD)

13,277 views • 4 months ago

Greg Brockman says OpenAI started with a belief that AGI would come from ideas, not compute But by 2017, scaling became the clearest path: 2x compute made agents 2x better 10 GW is a drop in the bucket compared to the compute needed for the AGI vision

Greg Brockman says OpenAI started with a belief that AGI would come from ideas, not compute But by 2017, scaling became the clearest path: 2x compute made agents 2x better 10 GW is a drop in the bucket compared to the compute needed for the AGI vision

Haider.

89,012 views • 8 months ago

OpenAI CFO Sarah Friar: The business case for AI is simple: more compute, more revenue OpenAI scaled from 200 megawatts to 2 gigawatts, and ARR climbed from $2 billion to over $20 billion, yet we remain absolutely constrained on compute "the signal from everywhere is real... we are in a paradigm shift"

OpenAI CFO Sarah Friar: The business case for AI is simple: more compute, more revenue OpenAI scaled from 200 megawatts to 2 gigawatts, and ARR climbed from $2 billion to over $20 billion, yet we remain absolutely constrained on compute "the signal from everywhere is real... we are in a paradigm shift"

Haider.

18,109 views • 5 months ago

GPT-5 Pro is genuinely a top tier model, it is the best. I know some GPT-5 reactions have been mixed (including from me), but OpenAI has made great strides with Pro. I have made 12 simulations that no other model would match, not from OpenAI, Google, xAI or Anthropic. All of them were 1-3 shots (mostly to fix small issues) and the output is really outstanding. OpenAI Developers

GPT-5 Pro is genuinely a top tier model, it is the best. I know some GPT-5 reactions have been mixed (including from me), but OpenAI has made great strides with Pro. I have made 12 simulations that no other model would match, not from OpenAI, Google, xAI or Anthropic. All of them were 1-3 shots (mostly to fix small issues) and the output is really outstanding. OpenAI Developers

Peter Gostev

58,631 views • 10 months ago

Ever suspected a paper you’re reading is AI slop? You can now turn on AI detection mode on alphaXiv to visualize what is written by an AI and what is not. Now available for every research paper indexed on arXiv. Integrated with the latest detection model from 🚀

Ever suspected a paper you’re reading is AI slop? You can now turn on AI detection mode on alphaXiv to visualize what is written by an AI and what is not. Now available for every research paper indexed on arXiv. Integrated with the latest detection model from 🚀

alphaXiv

97,491 views • 2 months ago