Loading video...

Video Failed to Load

Go Home

Andrej Karpathy: Internet training data is terrible, so big models end up compressing "memory" instead of doing cognitive work Use intelligent models to filter to the cognitive core With cleaner data, smaller models, likely distilled from a stronger one, are enough

384,224 views • 7 months ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Andrej Karpathy just made one of the most interesting arguments about AI model design that most people are completely missing. His take is that frontier AI models are not too big because the technology is complex and too big because the training data is garbage. When you or I think of the internet, we picture Wall Street Journal articles, Wikipedia entries, serious writing. That is not what a pretraining dataset looks like. When researchers at frontier labs look at random documents from the actual training corpus, it is stock ticker symbols, broken HTML, spam, gibberish. One estimate puts Llama 3's information compression at just 0.07 bits per token meaning the model has only a hazy recollection of most of what it trained on. So we build trillion parameter models not because we need a trillion parameter brain but because we need a trillion-parameter compression engine to squeeze some intelligence out of a firehose of noise. Most of those parameters are doing memory work, not cognitive work. Karpathy's prediction is separate the two entirely. Build a cognitive core, a model that contains only the algorithms for reasoning and problem-solving, stripped of encyclopedic memorization and pair it with external memory that it can query when it needs facts. He thinks a cognitive core trained on high-quality data could hit genuine intelligence at around one billion parameters. For reference, today's flagship models run between 200 billion and 1.8 trillion parameters with most of that weight dedicated to remembering the internet's slop. The trend is already moving his direction. GPT-4o operates at roughly 200 billion parameters and outperforms the original 1.8 trillion-parameter GPT-4. Inference costs for GPT-3.5-level performance dropped 280-fold between 2022 and 2024 driven almost entirely by smaller, cleaner, better-architected models. The real bottleneck in AI right now is not compute but rather data quality.

Milk Road AI

199,951 views • 1 month ago