Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Transformer: Multi-Head Attention ~ Math vs Code 🔢💻 ~ I made this visualization to show you how to implement the multi-head attention math in PyTorch within 50 LoC. Multi-Head Attention is what makes the Transformer's performance outstanding. It captures and represents more diverse linguistic relationships and patterns, and attends... show more

Yan Chen

1,812 subscribers

33,326 views • 1 year ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

AlphaFold by hand✍️ Excel ~ I designed this exercise to show (1) MSA multi-head attention, (2) Pair triangular update, two key components of the EvoFormer architecture.👇Join the AI Math community. Download xlsx.

AlphaFold by hand✍️ Excel ~ I designed this exercise to show (1) MSA multi-head attention, (2) Pair triangular update, two key components of the EvoFormer architecture.👇Join the AI Math community. Download xlsx.

Tom Yeh

104,990 views • 1 year ago

Transformer by hand✍️ Excel ~ I designed this exercise to show the core math of a Transformer model is to combine columns (attention), combine rows (feed forward), and repeat.👇Join the 'AI Math' community. 👇Download xlsx.

Transformer by hand✍️ Excel ~ I designed this exercise to show the core math of a Transformer model is to combine columns (attention), combine rows (feed forward), and repeat.👇Join the 'AI Math' community. 👇Download xlsx.

Tom Yeh

66,770 views • 1 year ago

THE BEST visual explainer of how information propagates through a transformer. If you want to have more than intuition about how the Transformer architecture is ruling the LLM world - open-source project explains everything about LLM Transformer Models! - A great resource for anyone looking to gain a deeper understanding of how Transformer-based AI models like GPT work, including: - Self-attention mechanisms - Encoder-decoder architecture - Positional encoding - Multi-head attention

THE BEST visual explainer of how information propagates through a transformer. If you want to have more than intuition about how the Transformer architecture is ruling the LLM world - open-source project explains everything about LLM Transformer Models! - A great resource for anyone looking to gain a deeper understanding of how Transformer-based AI models like GPT work, including: - Self-attention mechanisms - Encoder-decoder architecture - Positional encoding - Multi-head attention

Rohan Paul

106,897 views • 1 year ago

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. I also made a medium version.👇Join the 'AI Math' community. Download xlsx.

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. I also made a medium version.👇Join the 'AI Math' community. Download xlsx.

Tom Yeh

54,482 views • 1 year ago

Single vs Multi-hand Attention by hand ✍️ Resize matrices yourself 👉 The most important fact about multi-head attention: it has the same parameter count as single-head attention. The difference is purely structural — same total Wqkv weights, partitioned into smaller q–k–v triples. Look at the two diagrams below. Both Wqkv matrices have the same height — same number of weight rows, same number of parameters. What changes is how that single tall block is sliced. • Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors. • Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples — each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left. The compute trade-off — kind of. Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H. • Single-head: seq × seq = 40² = 1600 dot products • Multi-head: seq × seq × H = 40² × 3 = 4800 dot products (3×) But each multi-head dot product is narrower — its inner dimension is head_dim instead of H × head_dim. So when you count actual scalar multiplications, the totals are equal: • Single-head: seq² × (H × head_dim) = 40² × 36 = 57600 • Multi-head: seq² × H × head_dim = 40² × 3 × 12 = 57600 Same FLOPs. Multi-head buys you H independent attention patterns at no extra weight cost and no extra arithmetic cost — it's the same total compute, sliced into H finer-grained heads.

Single vs Multi-hand Attention by hand ✍️ Resize matrices yourself 👉 The most important fact about multi-head attention: it has the same parameter count as single-head attention. The difference is purely structural — same total Wqkv weights, partitioned into smaller q–k–v triples. Look at the two diagrams below. Both Wqkv matrices have the same height — same number of weight rows, same number of parameters. What changes is how that single tall block is sliced. • Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors. • Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples — each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left. The compute trade-off — kind of. Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H. • Single-head: seq × seq = 40² = 1600 dot products • Multi-head: seq × seq × H = 40² × 3 = 4800 dot products (3×) But each multi-head dot product is narrower — its inner dimension is head_dim instead of H × head_dim. So when you count actual scalar multiplications, the totals are equal: • Single-head: seq² × (H × head_dim) = 40² × 36 = 57600 • Multi-head: seq² × H × head_dim = 40² × 3 × 12 = 57600 Same FLOPs. Multi-head buys you H independent attention patterns at no extra weight cost and no extra arithmetic cost — it's the same total compute, sliced into H finer-grained heads.

Tom Yeh

35,448 views • 3 months ago

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. 👇Join the 'AI Math' community. Download xlsx.

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. 👇Join the 'AI Math' community. Download xlsx.

Tom Yeh

101,555 views • 1 year ago

kling 3.0 is crazy... this model works very different than any other model this is from giving it just 2 images and a multi prompt then it handled all the scenes by itself... now imagine what it could do if you gave it more references and an even more detailed multi prompt going to test this a lot more but kling definetly cooked on this one

kling 3.0 is crazy... this model works very different than any other model this is from giving it just 2 images and a multi prompt then it handled all the scenes by itself... now imagine what it could do if you gave it more references and an even more detailed multi prompt going to test this a lot more but kling definetly cooked on this one

Miko

95,650 views • 5 months ago

QXR Head Multi Bug The head multi is higher than it is supposed to be. Do not abuse the bug in ranked just in case. We are not responsible for any results and hopefully Call of Duty: Mobile will fix it soon. Also as it seems, sometimes the QXR bug works and sometimes it doesn't.(it's also linked to if thermite/explosive/dragon breath rounds work as intended) #CODMobile #callofdutymobile #codm #codmnews #codmleaks

QXR Head Multi Bug The head multi is higher than it is supposed to be. Do not abuse the bug in ranked just in case. We are not responsible for any results and hopefully Call of Duty: Mobile will fix it soon. Also as it seems, sometimes the QXR bug works and sometimes it doesn't.(it's also linked to if thermite/explosive/dragon breath rounds work as intended) #CODMobile #callofdutymobile #codm #codmnews #codmleaks

Leakers On Duty

59,411 views • 3 months ago

Transformer by hand ✍️ ~ 6 steps walkthrough below Open the hood of a transformer and the parts list is overwhelming: embeddings, positional encoding, attention weighting, self-attention, cross-attention, multi-head attention, layer norm, skip connections, softmax, linear, Nx, shifted right, query, key, value, masking. Which of those actually make the car run? Two of them. Attention weighting and the feed-forward network. Everything else is an enhancement to make it run faster and longer, which is how we got from a car to a truck, and to the word "large" in large language model. So I drew and calculated those two parts entirely by hand. Goal: push five features through one transformer block, filling in every cell yourself. 1. Given Five positions of input features, arriving from the previous block. 2. Attention matrix Let us feed all five features to a query-key module (QK) and read back an attention weight matrix, A. The details of that module are a post of their own. 3. Attention weighting We multiply the input features by A to get the attention weighted features, Z. Still five positions. The effect is to combine features *across positions*, horizontally: X1 becomes X1 + X2, X2 becomes X2 + X3, and so on. 4. First layer Let us feed all five weighted features into the first layer of the FFN. Multiply by the weights and biases. This time the combining happens *across feature dimensions*, vertically, and each feature grows from 3 numbers to 4. Note that every position goes through the same weight matrix. That is what "position-wise" means. 5. ReLU We cross out the negatives. They become zeros. 6. Second layer Let us bring it back down: 4 dimensions to 3. The output feeds the next block, which has a completely separate set of parameters, and the whole thing runs again. You have just calculated a transformer block by hand. ✍️ The takeaway: the two parts are doing two different jobs, and neither one alone is enough. Attention mixes *across positions*, so a feature can see its neighbours. The FFN mixes *across feature dimensions*, so each position can think about itself. Horizontal, then vertical. Then that pattern repeats N times, each block with its own separate set of weights. That is the Nx from the list up top, and that is what makes the transformer run. 💾 Save this post! #AIbyHand #Transformers #DeepLearning

Transformer by hand ✍️ ~ 6 steps walkthrough below Open the hood of a transformer and the parts list is overwhelming: embeddings, positional encoding, attention weighting, self-attention, cross-attention, multi-head attention, layer norm, skip connections, softmax, linear, Nx, shifted right, query, key, value, masking. Which of those actually make the car run? Two of them. Attention weighting and the feed-forward network. Everything else is an enhancement to make it run faster and longer, which is how we got from a car to a truck, and to the word "large" in large language model. So I drew and calculated those two parts entirely by hand. Goal: push five features through one transformer block, filling in every cell yourself. 1. Given Five positions of input features, arriving from the previous block. 2. Attention matrix Let us feed all five features to a query-key module (QK) and read back an attention weight matrix, A. The details of that module are a post of their own. 3. Attention weighting We multiply the input features by A to get the attention weighted features, Z. Still five positions. The effect is to combine features across positions, horizontally: X1 becomes X1 + X2, X2 becomes X2 + X3, and so on. 4. First layer Let us feed all five weighted features into the first layer of the FFN. Multiply by the weights and biases. This time the combining happens across feature dimensions, vertically, and each feature grows from 3 numbers to 4. Note that every position goes through the same weight matrix. That is what "position-wise" means. 5. ReLU We cross out the negatives. They become zeros. 6. Second layer Let us bring it back down: 4 dimensions to 3. The output feeds the next block, which has a completely separate set of parameters, and the whole thing runs again. You have just calculated a transformer block by hand. ✍️ The takeaway: the two parts are doing two different jobs, and neither one alone is enough. Attention mixes across positions, so a feature can see its neighbours. The FFN mixes across feature dimensions, so each position can think about itself. Horizontal, then vertical. Then that pattern repeats N times, each block with its own separate set of weights. That is the Nx from the list up top, and that is what makes the transformer run. 💾 Save this post! #AIbyHand #Transformers #DeepLearning

Tom Yeh

25,559 views • 10 days ago

Self Attention vs Cross Attention by hand ✍️ Resize the matrices yourself 👉 Two attention mechanisms, side by side. Both project X into queries; both compute attention via S = Kᵀ × Q and F = V × A. The only difference is the source of K and V. Self attention uses X for everything. Q, K, and V all come from projecting X. Each X token attends to every other X token. The score matrix S is square — 128 × 128. Cross attention uses X for queries and a second sequence E for keys and values. Each X token attends to every E token instead. The score matrix S is rectangular — 64 × 128. Notice what's shared and what's not: X is the same in both — same 36 × 128 input. Q and K share the 16 dimension — that's what makes the dot product Kᵀ × Q valid in either case. V dimensions are independent: self-attention uses 12, cross-attention uses 12. The choice doesn't depend on which mechanism you're using; it depends on what output dimension your downstream layer expects.

Self Attention vs Cross Attention by hand ✍️ Resize the matrices yourself 👉 Two attention mechanisms, side by side. Both project X into queries; both compute attention via S = Kᵀ × Q and F = V × A. The only difference is the source of K and V. Self attention uses X for everything. Q, K, and V all come from projecting X. Each X token attends to every other X token. The score matrix S is square — 128 × 128. Cross attention uses X for queries and a second sequence E for keys and values. Each X token attends to every E token instead. The score matrix S is rectangular — 64 × 128. Notice what's shared and what's not: X is the same in both — same 36 × 128 input. Q and K share the 16 dimension — that's what makes the dot product Kᵀ × Q valid in either case. V dimensions are independent: self-attention uses 12, cross-attention uses 12. The choice doesn't depend on which mechanism you're using; it depends on what output dimension your downstream layer expects.

Tom Yeh

61,300 views • 3 months ago

Wow. This is one of the best interactive sites I’ve seen for learning how LLMs work! 🔥 It starts w/ a clear intro and guides you through every core component: from Embedding, Layer Norm, and Self-Attention to MLPs, Transformer blocks, Softmax, and Output layers. link in 🧵↓

Wow. This is one of the best interactive sites I’ve seen for learning how LLMs work! 🔥 It starts w/ a clear intro and guides you through every core component: from Embedding, Layer Norm, and Self-Attention to MLPs, Transformer blocks, Softmax, and Output layers. link in 🧵↓

Charly Wargnier

52,092 views • 1 year ago

Trump talking about North Korean dictator Kim Jong Un in 2018: “He’s the head of a country, and I mean, he’s the strong head…He speaks, and his people sit up in attention. I want my people to do the same.” When later asked by another reporter to expand on the remark, Trump said he was “kidding.” Does this scare you or is this a quality you want in a future President?

Trump talking about North Korean dictator Kim Jong Un in 2018: “He’s the head of a country, and I mean, he’s the strong head…He speaks, and his people sit up in attention. I want my people to do the same.” When later asked by another reporter to expand on the remark, Trump said he was “kidding.” Does this scare you or is this a quality you want in a future President?

Brian Krassenstein

1,062,095 views • 2 years ago

Kling AI 3.0 is the Nano Banana Pro moment for video models. Highlight: Multi cut with up to 15s per run and enhanced lip sync. The performance of characters is the best I’ve seen so far! And you can literally use it like a reference model. This is the image I used:

Kling AI 3.0 is the Nano Banana Pro moment for video models. Highlight: Multi cut with up to 15s per run and enhanced lip sync. The performance of characters is the best I’ve seen so far! And you can literally use it like a reference model. This is the image I used:

Halim Alrasihi

74,855 views • 5 months ago

"I'm 20 years old and anxious about the future. How can I prepare to be part of the acceleration?" Time to grind. Inhale more Math that you think you reasonably learn. Do this until 30 and check back in. It's what I did.

"I'm 20 years old and anxious about the future. How can I prepare to be part of the acceleration?" Time to grind. Inhale more Math that you think you reasonably learn. Do this until 30 and check back in. It's what I did.

Beff (e/acc)

45,420 views • 1 year ago

Modern robotic wrist joints often use timing belt differentials to achieve smooth, multi-axis movement within compact spaces. By distributing motion through synchronized belt systems, a single actuator can control multiple rotational outputs with high precision. This design reduces weight, minimizes backlash, and allows for more efficient force transmission compared to traditional gear-based systems. It is widely used in robotic arms where accuracy, responsiveness, and compactness are critical. The result is more natural, fluid motion-bringing robotic systems closer to human-like dexterity in industrial and automation applications.

Modern robotic wrist joints often use timing belt differentials to achieve smooth, multi-axis movement within compact spaces. By distributing motion through synchronized belt systems, a single actuator can control multiple rotational outputs with high precision. This design reduces weight, minimizes backlash, and allows for more efficient force transmission compared to traditional gear-based systems. It is widely used in robotic arms where accuracy, responsiveness, and compactness are critical. The result is more natural, fluid motion-bringing robotic systems closer to human-like dexterity in industrial and automation applications.

Mechanical Knowledge

117,717 views • 3 months ago

Version 3 is here. The character actions are much more ambitious this time, which also makes generation more challenging. Expect to run a few generations before landing the perfect result, but the payoff is worth it. Fork the workflow and show me what you create. 🚀

Version 3 is here. The character actions are much more ambitious this time, which also makes generation more challenging. Expect to run a few generations before landing the perfect result, but the payoff is worth it. Fork the workflow and show me what you create. 🚀

underwood

51,291 views • 1 month ago

Throughout my journey in developing multimodal models, I’ve always wanted a framework that lets me plug & play modality encoders/decoders on top of an auto-regressive LLM. I want to prototype fast, try new architectures, and have my demo files scale effortlessly — with full support for parallelism and optimization. Not just to hack⚙️, but also to scale🚀. So finally we built it for ourselves. LMMs-Engine: a lean, efficient framework built to train unified multimodal model at scale. From Qwen LLM, VLM, LLaVA-OV, and WanVideo, to unified models like Qwen-Omni and BAGEL — plus Linear-Attn GDN and research prototypes like RAE and SiT - all under one modular system that seamlessly integrates diverse datasets and optimization strategies. Powered by FSDP2 multi-dim parallelism, Ulysses sequence parallel, Flash-Attention, Liger Kernels, and Native Sparse Attention (also with bonus support for the Muon optimizer for all models).

Throughout my journey in developing multimodal models, I’ve always wanted a framework that lets me plug & play modality encoders/decoders on top of an auto-regressive LLM. I want to prototype fast, try new architectures, and have my demo files scale effortlessly — with full support for parallelism and optimization. Not just to hack⚙️, but also to scale🚀. So finally we built it for ourselves. LMMs-Engine: a lean, efficient framework built to train unified multimodal model at scale. From Qwen LLM, VLM, LLaVA-OV, and WanVideo, to unified models like Qwen-Omni and BAGEL — plus Linear-Attn GDN and research prototypes like RAE and SiT - all under one modular system that seamlessly integrates diverse datasets and optimization strategies. Powered by FSDP2 multi-dim parallelism, Ulysses sequence parallel, Flash-Attention, Liger Kernels, and Native Sparse Attention (also with bonus support for the Muon optimizer for all models).

Brian Li

54,822 views • 9 months ago

claude design is my all time favorite anthropic drop this year, even more than opus 4.7 its actually really really good at design and the "handoff to claude code" makes it basically copy/paste ui to the actual working product claude design made the entire ui for merl (full video on merl soon)

claude design is my all time favorite anthropic drop this year, even more than opus 4.7 its actually really really good at design and the "handoff to claude code" makes it basically copy/paste ui to the actual working product claude design made the entire ui for merl (full video on merl soon)

ashen

366,742 views • 2 months ago

Justin T. UVU Palmer Luckey I'm glad you brought that to our attention. I think this one is a bird and that's good to compare. This one is moving up and down in a more organic way as it flies left to right and the shape seems to change frame over frame. I also see motion blur that looks like wing flapping.

Justin T. UVU Palmer Luckey I'm glad you brought that to our attention. I think this one is a bird and that's good to compare. This one is moving up and down in a more organic way as it flies left to right and the shape seems to change frame over frame. I also see motion blur that looks like wing flapping.

Crowdsource The Truth

418,568 views • 10 months ago