
Dwarkesh Patel
@dwarkesh_sp • 235,884 subscribers
Host of @dwarkeshpodcast https://t.co/3SXlu7fy6N https://t.co/4DPAxODFYi https://t.co/hQfIWdM1Un
Shorts
Videos

Recently met Sasha Rush and he started giving me an impromptu lecture on how targeted on-policy self-distillation works. I asked him if I could record it on my iPhone. The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory. So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made. Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required. The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.
Dwarkesh Patel279,217 次观看 • 1 天前

Mathematicians and scientists often peak in their 20s. Why? Maybe older scientists become stuck in their ways. Or maybe younger researchers feel free to be more creative. But Jacob Kimmel's hypothesis is that this isn't because of social factors at all - it's evolution:
Dwarkesh Patel127,504 次观看 • 1 天前

The Jensen Huang episode. 0:00:00 – Is Nvidia’s biggest moat its grip on scarce supply chains? 0:16:25 – Will TPUs break Nvidia’s hold on AI compute? 0:41:06 – Why doesn’t Nvidia become a hyperscaler? 0:57:36 – Should we be selling AI chips to China? 1:35:06 – Why doesn’t Nvidia make multiple different chip architectures? Look up Dwarkesh Podcast on YouTube, Apple Podcasts, Spotify, etc. Enjoy!
Dwarkesh Patel6,143,176 次观看 • 1 个月前

New blackboard lecture w Reiner Pope How do chips actually work – starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. 0:00:00 – Building a multiply-accumulate from logic gates 0:16:20 – Muxes and the cost of data movement 0:25:59 – How systolic arrays work 0:39:00 – Clock cycles and pipeline registers 0:51:40 – FPGAs vs ASICs 1:03:14 – Cache vs scratchpad 1:07:16 – Why CPU cores are much bigger than GPU cores 1:11:49 – Brains vs chips 1:15:22 – A GPU is just a bunch of tiny TPUs Look up Dwarkesh Podcast on YouTube/Spotify/etc to watch. Enjoy!
Dwarkesh Patel919,006 次观看 • 13 天前

Over the last 200 years, we've automated away a lot of hard physical labour. But people still go to the gym. Indeed, many people today are more physically capable than people in the past. We can train systematically for whatever physical goal we want, and it's more fun than hard labour on a pre-modern farm. Andrej Karpathy's hope is that, in the future, the same will be true of learning. AI tutoring that's tailored to each person will make learning easy, and more people will want to do it. We will be able to go much further than our ancestors.
Dwarkesh Patel162,325 次观看 • 3 天前

If animals live longer, they can have more kids, and pass on more of their genes. So why hasn't evolution solved aging? Jacob Kimmel explained the three evolutionary reasons why we still grow old: (1) Longevity doesn't actually help that much. Most of our primate ancestors didn't die of natural causes: if you are probably going to be eaten by a tiger, a naturally long lifespan is pointless. (2) Some genes might benefit from aging! If an older animal dies, often the resources they would have consumed are instead consumed by younger kin, who are genetically similar to them but more likely to reproduce. (3) Even if the two other arguments are wrong and there's selection against aging, there might be even stronger selection on other traits. Evolution just might not prioritise longevity.
Dwarkesh Patel116,769 次观看 • 2 天前

Stalin was actually a smart, well-read guy. His library contained loads of books from literature, history, political theory, all marked with his own personal annotations. Yet despite all this reading, he seems never to have once even doubted the truth of his Marxist dogma.
Dwarkesh Patel96,405 次观看 • 2 天前

New blackboard lecture w Eric Jang He walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. Timestamps: 0:00:00 – Basics of Go 0:08:06 – Monte Carlo Tree Search 0:31:53 – What the neural network does 1:00:22 – Self-play 1:25:27 – Alternative RL approaches 1:45:36 – Why doesn’t MCTS work for LLMs 2:00:58 – Off-policy training 2:11:51 – RL is even more information inefficient than you thought 2:22:05 – Automated AI researchers
Dwarkesh Patel681,666 次观看 • 20 天前

Did a very different format with Reiner Pope – a blackboard lecture where he walks through how frontier LLMs are trained and served. It's shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there - it’s really worth it. There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him. Recommend watching this one on YouTube so you can see the chalkboard. 0:00:00 – How batch size affects token cost and speed 0:31:59 – How MoE models are laid out across GPU racks 0:47:02 – How pipeline parallelism spreads model layers across racks 1:03:27 – Why Ilya said, “As we now know, pipelining is not wise.” 1:18:49 – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal 1:32:52 – Deducing long context memory costs from API pricing 2:03:52 – Convergent evolution between neural nets and cryptography
Dwarkesh Patel1,278,400 次观看 • 1 个月前

We pre-train LLMs on the whole of the internet. You might think this explains how they learn so many emergent capabilities: the knowledge is implicit in the training data. But in fact models can do things that were never demonstrated anywhere in training! Sergey Levine argues that the real source of emergent capabilities is compositionality:
Dwarkesh Patel150,324 次观看 • 5 天前

The Andrej Karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self driving took so long 1:57:08 - Future of education Look up Dwarkesh Podcast on YouTube, Apple Podcasts, Spotify, etc. Enjoy!
Dwarkesh Patel10,743,891 次观看 • 7 个月前

The day we discovered dark energy was "possibly the worst day in human history", says physicist Adam Brown. This discovery inevitably consigns human civilization to heat death, unless we can change the way physics works. And Adam's hope is that we can do exactly that.
Dwarkesh Patel195,155 次观看 • 7 天前

Jensen on the famous story about Larry Ellison and Elon Musk begging him for GPUs over dinner: "That never happened. We absolutely had dinner, and it was a wonderful dinner. At no time did they beg for GPUs. They just had to place an order." Jensen says Nvidia's allocation system is simple: you forecast, you place a purchase order, you get in the queue. First in, first out. And if your data center isn't ready, they might serve someone else first to maximize throughput. That's it. Nvidia doesn’t do highest-bidder pricing: "You set your price, and then people decide to buy it or not. I understand that others in the chip industry change their prices when demand is higher, but we just don't. That's just never been a practice of ours." "I prefer to be dependable, to be the foundation of the industry. You don't need to second-guess. If I quoted you a price, we quoted you a price. And if demand goes through the roof, so be it."
Dwarkesh Patel1,311,291 次观看 • 1 个月前

There's a quadrillion-dollar question at the heart of AI: Why are humans so much more sample efficient compared to LLM? There are three possible answers: 1. Architecture and hyperparameters (aka transformer vs whatever ‘algo’ cortical columns are implementing) 2. Learning rule (backprop vs whatever brain is doing) 3. Reward function Adam Marblestone believes the answer is the reward function. ML likes to use pretty simple loss functions, like cross-entropy. These are easy to work with. But they might be too simple for sample-efficient learning. Adam thinks that, in humans, the large number of highly specialised cells in the ‘lizard brain’ might actually be encoding information for sophisticated loss functions, used for ‘training’ in the more sophisticated areas like the cortex and amygdala. Like: the human genome is barely 3 gigabytes (compare that to the TBs of parameters that encode frontier LLM weights). So how can it include all the information necessary to build highly intelligent learners? Well, if the key to sample-efficient learning resides in the loss function, even very complicated loss functions can still be expressed in a couple hundred lines of Python code.
Dwarkesh Patel946,066 次观看 • 1 个月前

.John Collison and I interviewed Elon Musk. 0:00:00 - Orbital data centers 0:36:46 - Grok and alignment 0:59:56 - xAI’s business plan 1:17:21 - Optimus and humanoid manufacturing 1:30:22 - Does China win by default? 1:44:16 - Lessons from running SpaceX 2:20:08 - DOGE 2:38:28 - TeraFab
Dwarkesh Patel3,529,368 次观看 • 3 个月前

Distilled recap of the back-and-forth with Jensen on export controls: Dwarkesh: Wouldn’t selling Nvidia chips to China enable them to train models like Claude Mythos with cyber offensive capabilities that would be threats to American companies and national security? Jensen: First of all, Mythos was trained on fairly mundane capacity and a fairly mundane amount of it by an extraordinary company. The amount of capacity and the type of compute it was trained on is abundantly available in China. Dwarkesh: With that, could they eventually train a model like Mythos? Yes. But the question is, because we have more FLOPs, American labs are able to get to this level of capabilities first. Furthermore, even if they trained a model like this, the ability to deploy it at scale matters. If you had a cyber hacker, it's much more dangerous if they have a million of them versus a thousand of them. Jensen: Your premise is just wrong. The fact of the matter is their AI development is going just fine. The best AI researchers in the world, because they are limited in compute, also come up with extremely smart algorithms. DeepSeek is not an inconsequential advance. The day that DeepSeek comes out on Huawei first, that is a horrible outcome for our nation. Dwarkesh: Currently, you can have a model like DeepSeek that can run on any accelerator if it's open source. Why would that stop being the case in the future? Jensen: Suppose it optimizes for Huawei. Suppose it optimizes for their architecture. It would put others at a disadvantage. As AI diffuses out into the rest of the world, their standards and their tech stack will become superior to ours because their models are open. Dwarkesh: Tesla sold extremely good electric vehicles to China for a long time. iPhones are sold in China. They didn't cause some lock-in. China will still make their version of EVs, and they're dominating, or smartphones, they're dominating. Jensen: We are not a car. The fact that I can buy this car brand one day and use another car brand another day is easy. Computing is not like that. There's a reason why x86 still exists. There's a reason why Arm is so sticky. These ecosystems are hard to replace. Dwarkesh: It's just hard to imagine that there's a long-term lock-in to the Chinese ecosystem, even if they have this slightly better open-source model for a while. American labs port across accelerators constantly. Anthropic's models are run on GPUs, they're run on Trainium, they're run on TPUs. There are so many things you can do, from distilling to a model that's well fit for your chips. Jensen: China is the largest contributor to open source software in the world. China's the largest contributor to open models in the world. Today it's built on the American tech stack, Nvidia’s. Fact. All five layers of the tech stack for AI are important. The United States ought to go win all five of them. in a few years time, I'm making you the prediction that when we want American technology to be diffused around the world—out to India, out to the Middle East, out to Africa, out to Southeast Asia—on that day, I will tell you exactly about today's conversation, about how your policy ... caused the United States to concede the second largest market in the world for no good reason at all.
Dwarkesh Patel1,241,996 次观看 • 1 个月前

The CCP is more like a VC fund than a traditional central planner. Arthur Kroeber argues this is how China has succeeded, gaining massive dominance in industrial manufacturing, and sidestepping the traditional failure modes of centrally planned economies. The CCP supports broad sectors rather than single nationalised firms, and encourages ruthless competition in those sectors. Even though the CCP knows that competition will cause state-supported firms to fail, it believes that a few winners will make up for the failures.
Dwarkesh Patel133,135 次观看 • 7 天前





