Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Andrej Karpathy: Internet training data is terrible, so big models end up compressing "memory" instead of doing cognitive work Use intelligent models to filter to the cognitive core With cleaner data, smaller models, likely distilled from a stronger one, are enough

Haider.

65,955 subscribers

384,224 views • 7 months ago •via X (Twitter)

Science & Technology News & Politics Education

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Andrej Karpathy on the importance of extremely smaller-sized distilled models (even 1Bn param model should be good enough) Video Credit - Original video from "No Priors: AI, Machine Learning, Tech, & Startups" YouTube Channel (Link in comment)

Andrej Karpathy on the importance of extremely smaller-sized distilled models (even 1Bn param model should be good enough) Video Credit - Original video from "No Priors: AI, Machine Learning, Tech, & Startups" YouTube Channel (Link in comment)

Rohan Paul

119,437 views • 1 year ago

Andrej Karpathy just made one of the most interesting arguments about AI model design that most people are completely missing. His take is that frontier AI models are not too big because the technology is complex and too big because the training data is garbage. When you or I think of the internet, we picture Wall Street Journal articles, Wikipedia entries, serious writing. That is not what a pretraining dataset looks like. When researchers at frontier labs look at random documents from the actual training corpus, it is stock ticker symbols, broken HTML, spam, gibberish. One estimate puts Llama 3's information compression at just 0.07 bits per token meaning the model has only a hazy recollection of most of what it trained on. So we build trillion parameter models not because we need a trillion parameter brain but because we need a trillion-parameter compression engine to squeeze some intelligence out of a firehose of noise. Most of those parameters are doing memory work, not cognitive work. Karpathy's prediction is separate the two entirely. Build a cognitive core, a model that contains only the algorithms for reasoning and problem-solving, stripped of encyclopedic memorization and pair it with external memory that it can query when it needs facts. He thinks a cognitive core trained on high-quality data could hit genuine intelligence at around one billion parameters. For reference, today's flagship models run between 200 billion and 1.8 trillion parameters with most of that weight dedicated to remembering the internet's slop. The trend is already moving his direction. GPT-4o operates at roughly 200 billion parameters and outperforms the original 1.8 trillion-parameter GPT-4. Inference costs for GPT-3.5-level performance dropped 280-fold between 2022 and 2024 driven almost entirely by smaller, cleaner, better-architected models. The real bottleneck in AI right now is not compute but rather data quality.

Andrej Karpathy just made one of the most interesting arguments about AI model design that most people are completely missing. His take is that frontier AI models are not too big because the technology is complex and too big because the training data is garbage. When you or I think of the internet, we picture Wall Street Journal articles, Wikipedia entries, serious writing. That is not what a pretraining dataset looks like. When researchers at frontier labs look at random documents from the actual training corpus, it is stock ticker symbols, broken HTML, spam, gibberish. One estimate puts Llama 3's information compression at just 0.07 bits per token meaning the model has only a hazy recollection of most of what it trained on. So we build trillion parameter models not because we need a trillion parameter brain but because we need a trillion-parameter compression engine to squeeze some intelligence out of a firehose of noise. Most of those parameters are doing memory work, not cognitive work. Karpathy's prediction is separate the two entirely. Build a cognitive core, a model that contains only the algorithms for reasoning and problem-solving, stripped of encyclopedic memorization and pair it with external memory that it can query when it needs facts. He thinks a cognitive core trained on high-quality data could hit genuine intelligence at around one billion parameters. For reference, today's flagship models run between 200 billion and 1.8 trillion parameters with most of that weight dedicated to remembering the internet's slop. The trend is already moving his direction. GPT-4o operates at roughly 200 billion parameters and outperforms the original 1.8 trillion-parameter GPT-4. Inference costs for GPT-3.5-level performance dropped 280-fold between 2022 and 2024 driven almost entirely by smaller, cleaner, better-architected models. The real bottleneck in AI right now is not compute but rather data quality.

Milk Road AI

199,951 views • 1 month ago

💡Why Vana? “Ultimately, AI models are only as good as their training data. So if you want to build the best AI, you need the best data.” shares the vision behind Vana’s mission, centered on DataDAOs that aggregate specific datasets, reward data owners for their contributions and level up AI models with high-quality data🏆

💡Why Vana? “Ultimately, AI models are only as good as their training data. So if you want to build the best AI, you need the best data.” shares the vision behind Vana’s mission, centered on DataDAOs that aggregate specific datasets, reward data owners for their contributions and level up AI models with high-quality data🏆

vana

108,354 views • 1 year ago

"World models or cognitive models is absolutely fundamental. Having the ability to generalize abstract knowledge is fundamental." - Gary Marcus, neuro-symbolic AI expert & cognitive scientist, at the DKGcon 2024, at the DKGcon 2024

"World models or cognitive models is absolutely fundamental. Having the ability to generalize abstract knowledge is fundamental." - Gary Marcus, neuro-symbolic AI expert & cognitive scientist, at the DKGcon 2024, at the DKGcon 2024

OriginTrail

26,811 views • 11 months ago

Tether Data, AI model training platform preview. This PaaS will be available to any company interested in (pre-)training own models. Bonus, at the core of this platform we're leveraging Holepunch's tech for all data-structures to make training and models highly-resilient and unstoppable. Soon available via Northern Data Group , leveraging 24k+ H100 GPUs.

Tether Data, AI model training platform preview. This PaaS will be available to any company interested in (pre-)training own models. Bonus, at the core of this platform we're leveraging Holepunch's tech for all data-structures to make training and models highly-resilient and unstoppable. Soon available via Northern Data Group , leveraging 24k+ H100 GPUs.

Paolo Ardoino 🤖

28,092 views • 1 year ago

Yoshua Bengio says AI does not create enough new jobs to balance the ones it replaces A handful of engineers earn huge salaries, while a vast number of workers face displacement as models master cognitive tasks "if we automate most of the cognitive work, what's gonna be left?"

Yoshua Bengio says AI does not create enough new jobs to balance the ones it replaces A handful of engineers earn huge salaries, while a vast number of workers face displacement as models master cognitive tasks "if we automate most of the cognitive work, what's gonna be left?"

Haider.

150,984 views • 5 months ago

This is the best way to understand how ML models actually work! Use Drawdata to draw a 2D dataset in Jupyter. Use it to actively pick data from the widget and update the model as the data is being drawn! Fully interactive, real-time, and open-source!

This is the best way to understand how ML models actually work! Use Drawdata to draw a 2D dataset in Jupyter. Use it to actively pick data from the widget and update the model as the data is being drawn! Fully interactive, real-time, and open-source!

Daily Dose of Data Science

52,070 views • 7 months ago

Hot tip for anyone doing AI dev: Use Ollama to easily run models like Deepseek-r1 or Gemma locally on your machine. It downloads them and spins up a server with an OpenAI SDK compatible API The smaller models are fast and good enough to work on new features or debug streaming without having to pay for API requests

Hot tip for anyone doing AI dev: Use Ollama to easily run models like Deepseek-r1 or Gemma locally on your machine. It downloads them and spins up a server with an OpenAI SDK compatible API The smaller models are fast and good enough to work on new features or debug streaming without having to pay for API requests

Wes Bos

153,425 views • 11 months ago

Data ownership & privacy are key to building an internet that benefits & protects its end-users AI is no different — Users’ data must be protected while interacting with AI Oasis' privacy-preserving smart contracts enable users to contribute data to AI models on their own terms

Data ownership & privacy are key to building an internet that benefits & protects its end-users AI is no different — Users’ data must be protected while interacting with AI Oasis' privacy-preserving smart contracts enable users to contribute data to AI models on their own terms

Oasis Labs

32,839 views • 3 years ago

Curious whether video generation models (like #SORA) qualify as world models? We conduct a systematic study to answer this question by investigating whether a video gen model is able to learn physical laws. Three are three key messages to take home: 1⃣The model generalises perfectly for in-distribution data, but fails to do out-of-distribution generalization. For combinatorial scenarios, scaling law is observed. 2⃣The models fail to abstract general rules and instead tries to mimic the closest training example. 3⃣The model prioritizes different attributes when referencing training data: color > size > velocity > shape. This work is a joint effort with our outstanding intern Yang Yue. Paper: Webpage:

Curious whether video generation models (like #SORA) qualify as world models? We conduct a systematic study to answer this question by investigating whether a video gen model is able to learn physical laws. Three are three key messages to take home: 1⃣The model generalises perfectly for in-distribution data, but fails to do out-of-distribution generalization. For combinatorial scenarios, scaling law is observed. 2⃣The models fail to abstract general rules and instead tries to mimic the closest training example. 3⃣The model prioritizes different attributes when referencing training data: color > size > velocity > shape. This work is a joint effort with our outstanding intern Yang Yue. Paper: Webpage:

Bingyi Kang

606,519 views • 1 year ago

Over the last few months, we’ve been thinking about how to learn from “off-domain” data - data from non-robot sources like video or simulation. These data sources are not quite good enough to learn policies (even monolithic VLA models) directly, but they still contain lots of information that can be useful for generalizable robot control. How can we develop robot learning models that are able to make use of this type of data for generalizable control? In new work, that we call HAMSTER, we show that VLMs can be useful for enabling robotic learning from off-domain data, but specifically when used through hierarchical VLA architectures. We show that this class of models can learn generalizable robot policies for the real world from large-scale, off-domain data. A 🧵 (1/10)

Over the last few months, we’ve been thinking about how to learn from “off-domain” data - data from non-robot sources like video or simulation. These data sources are not quite good enough to learn policies (even monolithic VLA models) directly, but they still contain lots of information that can be useful for generalizable robot control. How can we develop robot learning models that are able to make use of this type of data for generalizable control? In new work, that we call HAMSTER, we show that VLMs can be useful for enabling robotic learning from off-domain data, but specifically when used through hierarchical VLA architectures. We show that this class of models can learn generalizable robot policies for the real world from large-scale, off-domain data. A 🧵 (1/10)

Abhishek Gupta

11,994 views • 1 year ago

A large portion of animal intelligence doesn't require any learning, claims Andrej Karpathy: it's baked into DNA. AI models, by contrast, start from random weights. They have to learn their intelligence, mostly by imitating the internet. This is so different that Andrej thinks it's a fundamentally different kind of intelligence: LLMs are more like ghosts than animals.

A large portion of animal intelligence doesn't require any learning, claims Andrej Karpathy: it's baked into DNA. AI models, by contrast, start from random weights. They have to learn their intelligence, mostly by imitating the internet. This is so different that Andrej thinks it's a fundamentally different kind of intelligence: LLMs are more like ghosts than animals.

Dwarkesh Patel

81,596 views • 18 days ago

The AV industry is hitting a “Data Wall.” Brute-force training on petabytes of data is reaching a dead end. As models improve, further gains get harder. We bypassed this wall with Factored Embodied AI—achieving zero-shot autonomous steering with just 1,000 hours of driving data.

The AV industry is hitting a “Data Wall.” Brute-force training on petabytes of data is reaching a dead end. As models improve, further gains get harder. We bypassed this wall with Factored Embodied AI—achieving zero-shot autonomous steering with just 1,000 hours of driving data.

Helm.ai

3,003,098 views • 6 months ago

Geoffrey Hinton says the current path of scaling is hitting a limit Most high-value data is locked inside companies, and the "free internet" is largely exhausted The solution is for models to generate their own training data through reasoning "that's how AlphaGo beat humans"

Geoffrey Hinton says the current path of scaling is hitting a limit Most high-value data is locked inside companies, and the "free internet" is largely exhausted The solution is for models to generate their own training data through reasoning "that's how AlphaGo beat humans"

Haider.

149,374 views • 5 months ago

⚛️ TxGemma is a collection of open models designed to accelerate therapeutic data analysis with the power of Google DeepMind’s Gemma.

⚛️ TxGemma is a collection of open models designed to accelerate therapeutic data analysis with the power of Google DeepMind’s Gemma.

Google AI Developers

127,015 views • 1 year ago

BREAKING: AI has eaten the Internet. Data labeling is so over. & $30 trillion of human work is on the verge of automation. Inside The $2.2B AI Research Accelerator, Turing Founder & CEO, Jonathan Siddharth (), joins Sourcery to break down the severe power shift in AI training: from commodity data labeling → expert research Positioning Turing apart from AI data providers like Scale AI, Mercor, & Surge. (00:00) AI Ate The Internet (00:49) Training Superintelligence: the race to AGI (02:31) Viral tweet (03:24) What Turing actually does (04:43) The internet data is “used up” — where will new data come from? (05:34) Four pillars of superintelligence: multimodality, reasoning, tool use, coding (06:07) Automating $30T of global knowledge work (09:18) The $1B revenue opportunity (10:59) Why Turing is a research-first accelerator, not a data labeler (13:45) Jonathan’s Stanford AI Lab roots & founding DNA (17:57) How models are built: pre-training vs. post-training (20:14) RLHF, reinforcement learning, & “breaking the models” (25:19) GPT-5 and the myth of rapid takeoff (30:46) Safety debates and human-in-the-loop systems (34:53) Closing Enterprise Gap: finance, insurance, & pharma (39:23) Why proprietary enterprise data is the next moat in AI

BREAKING: AI has eaten the Internet. Data labeling is so over. & $30 trillion of human work is on the verge of automation. Inside The $2.2B AI Research Accelerator, Turing Founder & CEO, Jonathan Siddharth (), joins Sourcery to break down the severe power shift in AI training: from commodity data labeling → expert research Positioning Turing apart from AI data providers like Scale AI, Mercor, & Surge. (00:00) AI Ate The Internet (00:49) Training Superintelligence: the race to AGI (02:31) Viral tweet (03:24) What Turing actually does (04:43) The internet data is “used up” — where will new data come from? (05:34) Four pillars of superintelligence: multimodality, reasoning, tool use, coding (06:07) Automating $30T of global knowledge work (09:18) The $1B revenue opportunity (10:59) Why Turing is a research-first accelerator, not a data labeler (13:45) Jonathan’s Stanford AI Lab roots & founding DNA (17:57) How models are built: pre-training vs. post-training (20:14) RLHF, reinforcement learning, & “breaking the models” (25:19) GPT-5 and the myth of rapid takeoff (30:46) Safety debates and human-in-the-loop systems (34:53) Closing Enterprise Gap: finance, insurance, & pharma (39:23) Why proprietary enterprise data is the next moat in AI

Molly O’Shea

69,850 views • 8 months ago

.Trenton Bricken explains how we know LLMs are actually generalizing - aka they're not just stochastic parrots: - Training models on code makes them better at reasoning in language. - Models fine tuned on math problems become better at entity detection. - We can just straightforwardly read the world-models developed by smaller NNs which are easier to interpret (Othello). Transfer learning shows models are developing a deeper understanding of their data. Full episode out Thursday.

.Trenton Bricken explains how we know LLMs are actually generalizing - aka they're not just stochastic parrots: - Training models on code makes them better at reasoning in language. - Models fine tuned on math problems become better at entity detection. - We can just straightforwardly read the world-models developed by smaller NNs which are easier to interpret (Othello). Transfer learning shows models are developing a deeper understanding of their data. Full episode out Thursday.

Dwarkesh Patel

236,977 views • 2 years ago

Larry Ellison basically explained why OpenAI is positioned to win the AI race. Today’s models all trained on the same thing: public data. The internet. Books. Papers. General knowledge. Impressive, but capped. Ellison’s point: peak value only comes when AI is trained on private data. That’s where OpenAI is moving: • partnerships with universities to work with unpublished research • experts teaching models real job tasks, not just theory • deep access to how science, medicine, law, and finance actually work Public data makes AI smart. Private data makes it decisive. The next AI leap won’t come from reading more of the internet. It’ll come from being invited behind the curtain. That’s the real moat.

Larry Ellison basically explained why OpenAI is positioned to win the AI race. Today’s models all trained on the same thing: public data. The internet. Books. Papers. General knowledge. Impressive, but capped. Ellison’s point: peak value only comes when AI is trained on private data. That’s where OpenAI is moving: • partnerships with universities to work with unpublished research • experts teaching models real job tasks, not just theory • deep access to how science, medicine, law, and finance actually work Public data makes AI smart. Private data makes it decisive. The next AI leap won’t come from reading more of the internet. It’ll come from being invited behind the curtain. That’s the real moat.

VraserX e/acc

167,036 views • 5 months ago

Introducing OlmoEarth 🌍, state-of-the-art AI foundation models paired with ready-to-use open infrastructure to turn Earth data into clear, up-to-date insights within hours—not years.

Introducing OlmoEarth 🌍, state-of-the-art AI foundation models paired with ready-to-use open infrastructure to turn Earth data into clear, up-to-date insights within hours—not years.

Ai2

680,357 views • 7 months ago