Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Transformer: Multi-Head Attention ~ Math vs Code 🔢💻 ~ I made this visualization to show you how to implement the multi-head attention math in PyTorch within 50 LoC. Multi-Head Attention is what makes the Transformer's performance outstanding. It captures and represents more diverse linguistic relationships and patterns, and attends...

33,255 Aufrufe • vor 1 Jahr •via X (Twitter)

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Single vs Multi-hand Attention by hand ✍️ Resize matrices yourself 👉 The most important fact about multi-head attention: it has the same parameter count as single-head attention. The difference is purely structural — same total Wqkv weights, partitioned into smaller q–k–v triples. Look at the two diagrams below. Both Wqkv matrices have the same height — same number of weight rows, same number of parameters. What changes is how that single tall block is sliced. • Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors. • Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples — each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left. The compute trade-off — kind of. Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H. • Single-head: seq × seq = 40² = 1600 dot products • Multi-head: seq × seq × H = 40² × 3 = 4800 dot products (3×) But each multi-head dot product is narrower — its inner dimension is head_dim instead of H × head_dim. So when you count actual scalar multiplications, the totals are equal: • Single-head: seq² × (H × head_dim) = 40² × 36 = 57600 • Multi-head: seq² × H × head_dim = 40² × 3 × 12 = 57600 Same FLOPs. Multi-head buys you H independent attention patterns at no extra weight cost and no extra arithmetic cost — it's the same total compute, sliced into H finer-grained heads.

Tom Yeh

35,046 Aufrufe • vor 1 Monat