Introducing SubQ - a major breakthrough in LLM intelligence. It is the...

Uploaded: 2026-05-05T14:00:15.000Z
Duration: PT84.800S
Channel: Alexander Whedon

57:39

Met a guy making $1.6 million/year as an LLM engineer. I asked him how he learned LLMs from scratch. He sent me the exact video that got him in. A 1 hour course on how LLMs actually work. He shows how transformers inside LLMs like ChatGPT & Claude are actually built. I watched it last night. Halfway through, I realized LLM architecture is way simpler than they make it look. Bookmark this and read the article below. • 00:00 - LLM foundations • 04:21 - LLM tokenization • 05:43 - LLMs vector embeddings • 22:16 - attention mechanism of LLM • 43:42 - LLM multi head attention

Roan

73,925 views • 2 days ago

41:53

Andrej Karpathy just exposed how LLMs actually thinks: "LLMs don't want to succeed. They want to imitate." It has 80 transformer layers and spends the SAME compute on every single token as your brain In a 45-minute talk at Microsoft Build, Karpathy reveals the full psychology of LLMs. Worth more than any $500 prompting course you've seen on your timeline.

Morty

139,031 views • 20 days ago

3:36

Announcing How Transformer LLMs Work, created with Jay Alammar and Maarten Grootendorst, co-authors of the beautifully illustrated book, “Hands-On Large Language Models.” This course offers a deep dive into the inner workings of the transformer architecture that powers large language models (LLMs). The transformer architecture revolutionized generative AI; in fact, the "GPT" in ChatGPT stands for "Generative Pre-Trained Transformer." Originally introduced in the Google Brain team's groundbreaking 2017 paper "Attention Is All You Need," by Vaswani and others, transformers were a highly scalable model for machine translation tasks. Variants of this architecture now power today’s LLMs such as those from OpenAI, Google, Meta, Cohere, Anthropic and DeepSeek. In this course, you’ll learn in detail how LLMs process text. You'll also work through code examples that illustrate that transformer's individual components. In details, you’ll learn: - How the representation of language has evolved, from Bag-of-Words to Word2Vec embeddings to the transformer architecture that captures a word's meanings taking into account the context of other words in the input. - How inputs are broken down into tokens before they are sent to the language model. - The details of a transformer's main stages: Tokenization and embedding, the stack of transformer blocks, and the language model head. - The inner workings of the transformer block, including attention, which calculates relevance scores, and the feedforward layer, which incorporates stored information learned in training. - How cached calculations make transformers faster. - Some of the most recent ideas in the latest models such as Mixture-of-Experts (MoE) which uses multiple sub-models and a router on each layer to improve the quality of LLMs. By the end of this course, you’ll have a deep understanding of how LLMs actually process text and be able to read through papers describing the latest models and understand the details. Gaining this intuition will improve your approach to building LLM applications. Please sign up here:

Andrew Ng

259,421 views • 1 year ago

1:00

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

Andrew Ng

119,783 views • 2 months ago

3:01

Introducing Cognee v1.0: a major breakthrough in agentic intelligence. It is 145% better than Opus 4.8 and GPT 5.5 at long context memory retrieval. Cognee allows a 100 BILLION token context window 100,000x more than Claude. It's: - 6.9x cheaper than GPT 5.5 and Opus 4.8 - Cold starts in 350ms & searches in 260ms Why this matters: Today agents forget important context, redo tasks, waste tokens, and slow down as workflows get more complex. Cognee solves this. It’s not a place to build agents. It connects to the agents you’ve already built, across any platform, and makes them significantly cheaper, faster, and more accurate. Here's how it works:

Vasilije

844,442 views • 1 month ago

2:02

Not every AI interaction is best served with a large frontier model. There is a long tail of trillion-token use cases, from tagging to search, best served by a sub-10b-parameter model that runs in milliseconds and costs many orders of magnitude less than the frontier. We built Freesolo Flash to make training loops like SFT and RL easy and end-to-end completable through your coding agent.

Freesolo

26,686 views • 10 days ago

0:23

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

Avi Chawla

269,149 views • 1 month ago

2:11

New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with Red Hat and taught by Cedric Clyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management. Skills you'll gain: - Quantize a model and measure the accuracy tradeoff - Serve a model with vLLM and watch it handle concurrent requests efficiently - Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy Join and learn to serve LLMs efficiently:

Andrew Ng

125,665 views • 1 month ago

0:31

ELON: GROK’S A BEAST - IT CAN'T BE CRAMMED INTO A CAR “Grok is a giant model; you could not possibly squeeze Grok onto a car, that’s for sure. With Grok, it’s trying to solve for artificial intelligence, with a massive amount of AI training, compute, and inference compute. Grok 5 will only run effectively on a GV300 [AI training cluster]; that’s how much of a beast Grok 5 is. Whereas Tesla’s models are less than 10% the size, maybe closer to 5% the size of Grok.” Source: Elon Musk, Tesla Podcast, October 22, 2025

Mario Nawfal

33,523 views • 9 months ago

3:31

Sam Altman says the AI infrastructure race is not about bigger chatbots. It is about shifting the planet's intellectual output from human brains to AI brains. "If people knew what we could do with compute, they would want way, way more." "The thing I am personally most excited about is to use AI and lots of compute to discover new science." "A recent cool example here is we built the Sora Android app using Codex... in less than a month." "You can imagine that way much further, where entire companies can build their products using lots of compute." Then he gives the scale: "Let's say that an AI company today might be generating something on the order of 10 trillion tokens a day out of frontier models." "We're going to have these models at a company be outputting more tokens per day than all of humanity put together, and then 10 times that, and then 100 times that." The bottleneck is moving from software talent to compute supply, inference speed, product integration, and where all that machine intelligence gets pointed.

Karl Mehta

16,606 views • 29 days ago

0:26

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

157,390 views • 2 months ago

0:55

⚡️ JUST IN: MUSK LAUNCHES GROK 4.5, CALLING IT "OPUS-CLASS" AI MODEL Musk just released Grok 4.5 on Grok Build, Cursor, and the SpaceXAI Console, calling it an "Opus-class" AI model. He says it is faster, more token-efficient and lower cost, with EU access expected by mid-July. Built on a 1.5 TRILLION parameter foundation model, it drops the same day as OpenAI's GPT-5.6. API pricing is $2 per million input tokens and $6 per million output tokens.

Coin Bureau

50,488 views • 19 days ago

2:41

New short course: Attention in Transformers: Concepts and Code in PyTorch. Last week we released a course on how LLM transformers work. This week, go deeper and learn about the technical ideas behind the attention mechanism, and see how to code it in PyTorch. This course is built with Joshua Starmer, Founder and CEO of StatQuest. The attention mechanism was a breakthrough that led to transformers, the architecture powering large language models like ChatGPT. Transformers, introduced in the 2017 paper: "Attention is All You Need" by Viswani and others, took off because of its highly scalable design. In this course, you’ll learn how the attention mechanism, a key element of transformer-based LLMs, works and implement it in PyTorch. You'll develop deep intuition about building reliable, functional, and scalable AI applications. What you will do: - Understand the evolution of the attention mechanism, a key breakthrough that led to transformers. - Learn the relationships between word embeddings, positional embeddings, and attention. - Learn about the Query, Key, and Value matrices, and how to produce and use them in attention. - Walk through the math required to calculate self-attention and masked self-attention to learn why and how they work. - Understand the difference between self-attention and masked self-attention and how one is used in the encoder to build context-aware embeddings and the other is used in the decoder for generative outputs. - Learn the details of the encoder-decoder architecture, cross-attention, and multi-head attention and how they are all incorporated into a transformer. - Use PyTorch to code a class that implements self-attention, masked self-attention, and multi-head attention. There're lots of exciting technical details in this course. Please sign up here:

Andrew Ng

132,220 views • 1 year ago

0:21

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

vLLM

40,306 views • 1 month ago

0:56

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

elvis

261,622 views • 2 years ago

21:39

Andrej Karpathy just reveales how LLMs actually thinks: "GPT-4 knows it failed. It just won't tell you unless you ask." >80% of GPT-4 errors are recoverable - the model already knows it screwed up. It has 80 transformer layers and spends the SAME compute on every single token as your brain In a 20-minute speach at Microsoft Build, Karpathy reveals the full psychology of LLMs. Worth more than any $500 prompting course you've seen on your timeline.

Ricker

161,978 views • 13 days ago

0:53

Jonathan Ross just revealed why AI companies aren’t growing faster. Not demand. Not competition. Physics. Ross: “The demand for compute is insatiable.” There isn’t enough compute in the world. Not a temporary shortage. A fundamental gap between what the market wants and what the infrastructure can deliver. Ross: “Right now, one of the biggest complaints of Anthropic is the rate limits. People can’t get enough tokens.” Rate limits aren’t product decisions. They’re rationing. Companies forced to regulate access because infrastructure cannot meet demand. Slower services. Token caps. The only things standing between these companies and a revenue surge they can’t access. Every token cap is a revenue cap. Every slowdown is a sale that didn’t happen. Ross: “If Anthropic was given twice the inference compute, within one month their revenue would almost double.” Read that again. Double the compute. Double the revenue. Within thirty days. That’s not a growth projection. That’s a measurement of how deep the backlog already is. The demand exists right now. It’s sitting in a queue. The only thing between these companies and that revenue is physical hardware they don’t have. This breaks every assumption about how tech companies scale. Usually you scale by finding customers. AI companies have infinite customers. They scale by finding hardware. The constraint isn’t market fit. It isn’t distribution. It isn’t competition. It’s processing power. This is why Jensen Huang is the most important person in the world right now. NVIDIA doesn’t just make chips. It makes the thing every government, every AI lab, and every company racing for this future needs more of and can’t get enough of. The compute bottleneck isn’t a tech industry problem. It’s a civilizational one. The winner of this era isn’t determined by who builds the smartest model. Every major lab has a frontier model. The winner is whoever secures the most compute fastest while everyone else rations what’s left. The race isn’t for intelligence. It’s for infrastructure. And right now there isn’t enough to go around.

Dustin

28,395 views • 5 months ago

0:39

Micron is going to $4,000 and once you understand what inference actually is, the number stops sounding crazy (Save this). Dylan Patel just said that by 2030, OpenAI and Anthropic alone will need over 100 gigawatts of compute combined and by 2040, we may not even be measuring AI infrastructure in gigawatts anymore. We may be talking about terawatts. Every single one of those gigawatts needs memory to function. Without it, the compute is worthless. Most people heard that and thought about Nvidia but they should be thinking about Micron. Every AI model generating a response has two phases. The first is prefill, processing your prompt which is compute-heavy and the second is decode generating each word one token at a time and that phase is almost entirely memory-bound, not compute-bound. During decode, the GPU's processing units sit idle more than 95% of the time, waiting for data to arrive from memory. Google confirmed it in a research paper that decode-phase bottlenecks are dominated by memory bandwidth and capacity not raw compute. The GPU is not the bottleneck but the memory feeding the GPU is. This matters because inference is now where all the money lives. Training a model happens once, Inference happens billions of times a day every ChatGPT response, every Claude output, every agentic workflow running in the background and every one of those token streams is a billing event tied directly to memory performance. Adding more GPUs does not fix this because GPUs are already underutilized in inference because they are sitting idle waiting on memory. Adding more memory bandwidth and capacity is what directly reduces token cost, reduces latency, and allows the same cluster to serve dramatically more users simultaneously. Longer context windows compound the problem further, a model running a 1 million token context window requires dramatically more memory per session than a 10,000 token window, and every new model generation pushes context longer. The market treats memory as a downstream beneficiary of Nvidia orders. The correct framework is the opposite, Micron is the upstream constraint on how much value every Nvidia GPU can actually generate at inference scale. Micron guided Q4 to $50 billion in revenue, has HBM4 ramping at twice the pace of the prior generation, and CEO Sanjay Mehrotra has said supply will not catch demand before the end of 2027. At 8x forward earnings on $112 projected FY2027 EPS, Micron is the most undervalued infrastructure company in the entire AI stack. Inference is memory. Memory is Micron and the inference ramp has barely started. Milk Road Pro members are already up massively on this position and we're just getting started. If you want the full breakdown of what we're buying and why, come join us for just a dollar using the link below!

Milk Road AI

128,522 views • 26 days ago

1:07

We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days. Here are the tricks we used: When modelling DNA sequences a lot of the performance comes down to tokenizing the sequences in a smart way. BPE tokenizer struggle because there are no whitespaces and character (called base in DNA) level tokenizers waste a lot of compute on too many tokens. Carbon is built with a unique tokenizer: we split sequences in chunks of 6 bases, but during both training and inference we can work with single base resolution. That's similar to having word tokens but resolving them at the character level. All possible thanks to the DNA tokens unique structure. The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size. We built an interactive demo so you can explore how the model can generate DNA sequences, investigate the structure of genes, predict the effect of mutations, generate and fold proteins and even reconstruct parts of the tree of life.

Leandro von Werra

404,474 views • 2 months ago

0:59

"100 million words context window is already possible, which is roughly what a human hears in a lifetime. Inference support is the only bottleneck to achieve it. And AI Models actually do learn during the context window, without changing the weights." ~ Anthropic CEO Dario Amodei (On the 2nd point, there was this brilliant Google Paper published last week that says, LLMs can learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.) --- From 'Alex Kantrowitz' YT Channel (Full Video link in comment)

Rohan Paul

631,729 views • 9 months ago

Live Cam

Video Failed to Load

Alexander Whedon

Anya Rossi• Live Now

0 Comments

Related Videos