Tom Yeh's banner
Tom Yeh's profile picture

Tom Yeh

@ProfTomYeh55,997 subscribers

CS Prof | AI by Hand ✍️ | @CUBoulder

Shorts

Softmax vs Sigmoid ✍️ Interact 👉 = Softmax = Softmax is how deep networks turn raw scores into a probability distribution — the final layer of every classifier, and the core of every attention head in a transformer. To see what it does, picture five boba tea shops on the same block, all competing for your dollar. Five candidates: a, b, c, d, e — different chains, different brewing styles, different pearls. A boba reviewer hands you a 𝘤𝘩𝘦𝘸𝘪𝘯𝘦𝘴𝘴 𝘴𝘤𝘰𝘳𝘦 for each — higher means perfectly chewy "QQ" pearls with the right bite (ask a Taiwanese friend to find out what QQ means). Negative scores are real: mushy bobas, overcooked pearls, a batch left sitting too long. How do you turn five chewiness scores into an allocation that adds to a whole dollar? You could spend everything at the chewiest shop, but that ignores how good the runners-up are. Softmax is the smooth alternative. Read the diagram left to right. First, raise each score to e^{x} — this does two things: it turns negative chewiness into small positives, and it stretches the gaps between scores exponentially. Then sum all five into a single total Z. Finally, divide each e^{x} by Z to get a probability. The five probabilities add up to one, so you can read them as percentages of your dollar. The chewiest shop gets the biggest slice — but never the whole dollar. That's the point of softmax: it ranks confidently while still leaving room for the others. = Sigmoid = Sigmoid squashes any real number into a probability between 0 and 1 — the classic activation for binary classification, and still the gating function inside LSTMs and GRUs. Same boba block as the previous Softmax example, narrowed to just two contenders — a hot new shop `a` with chewiness score x, and your usual go-to `b` whose score is pinned at zero (the neutral baseline you've come to expect). Sigmoid is just softmax with two players, one of them pinned to zero. Read the diagram left to right. First, raise each score to e^{x} — for the usual shop `b` whose score is zero, this is just e^0 = 1 (the constant baseline). Then sum the two into a total Z. Finally, divide each e^{x} by Z to get a probability. The two probabilities add up to one — the new shop wins more of your dollar when its pearls get chewier, and your usual keeps the rest. That's the point of sigmoid: it turns a single chewiness score into a clean 0-to-1 chance you'll try the new place over your usual. --- AI Math, Algorithms, Architectures by hand ✍️ Subscribe to my 60K+ reader newsletter 👉

Softmax vs Sigmoid ✍️ Interact 👉 = Softmax = Softmax is how deep networks turn raw scores into a probability distribution — the final layer of every classifier, and the core of every attention head in a transformer. To see what it does, picture five boba tea shops on the same block, all competing for your dollar. Five candidates: a, b, c, d, e — different chains, different brewing styles, different pearls. A boba reviewer hands you a 𝘤𝘩𝘦𝘸𝘪𝘯𝘦𝘴𝘴 𝘴𝘤𝘰𝘳𝘦 for each — higher means perfectly chewy "QQ" pearls with the right bite (ask a Taiwanese friend to find out what QQ means). Negative scores are real: mushy bobas, overcooked pearls, a batch left sitting too long. How do you turn five chewiness scores into an allocation that adds to a whole dollar? You could spend everything at the chewiest shop, but that ignores how good the runners-up are. Softmax is the smooth alternative. Read the diagram left to right. First, raise each score to e^{x} — this does two things: it turns negative chewiness into small positives, and it stretches the gaps between scores exponentially. Then sum all five into a single total Z. Finally, divide each e^{x} by Z to get a probability. The five probabilities add up to one, so you can read them as percentages of your dollar. The chewiest shop gets the biggest slice — but never the whole dollar. That's the point of softmax: it ranks confidently while still leaving room for the others. = Sigmoid = Sigmoid squashes any real number into a probability between 0 and 1 — the classic activation for binary classification, and still the gating function inside LSTMs and GRUs. Same boba block as the previous Softmax example, narrowed to just two contenders — a hot new shop `a` with chewiness score x, and your usual go-to `b` whose score is pinned at zero (the neutral baseline you've come to expect). Sigmoid is just softmax with two players, one of them pinned to zero. Read the diagram left to right. First, raise each score to e^{x} — for the usual shop `b` whose score is zero, this is just e^0 = 1 (the constant baseline). Then sum the two into a total Z. Finally, divide each e^{x} by Z to get a probability. The two probabilities add up to one — the new shop wins more of your dollar when its pearls get chewier, and your usual keeps the rest. That's the point of sigmoid: it turns a single chewiness score into a clean 0-to-1 chance you'll try the new place over your usual. --- AI Math, Algorithms, Architectures by hand ✍️ Subscribe to my 60K+ reader newsletter 👉

72,875 görüntüleme

Self Attention vs Cross Attention by hand ✍️ Resize the matrices yourself 👉 Two attention mechanisms, side by side. Both project X into queries; both compute attention via S = Kᵀ × Q and F = V × A. The only difference is the source of K and V. Self attention uses X for everything. Q, K, and V all come from projecting X. Each X token attends to every other X token. The score matrix S is square — 128 × 128. Cross attention uses X for queries and a second sequence E for keys and values. Each X token attends to every E token instead. The score matrix S is rectangular — 64 × 128. Notice what's shared and what's not: X is the same in both — same 36 × 128 input. Q and K share the 16 dimension — that's what makes the dot product Kᵀ × Q valid in either case. V dimensions are independent: self-attention uses 12, cross-attention uses 12. The choice doesn't depend on which mechanism you're using; it depends on what output dimension your downstream layer expects.

Self Attention vs Cross Attention by hand ✍️ Resize the matrices yourself 👉 Two attention mechanisms, side by side. Both project X into queries; both compute attention via S = Kᵀ × Q and F = V × A. The only difference is the source of K and V. Self attention uses X for everything. Q, K, and V all come from projecting X. Each X token attends to every other X token. The score matrix S is square — 128 × 128. Cross attention uses X for queries and a second sequence E for keys and values. Each X token attends to every E token instead. The score matrix S is rectangular — 64 × 128. Notice what's shared and what's not: X is the same in both — same 36 × 128 input. Q and K share the 16 dimension — that's what makes the dot product Kᵀ × Q valid in either case. V dimensions are independent: self-attention uses 12, cross-attention uses 12. The choice doesn't depend on which mechanism you're using; it depends on what output dimension your downstream layer expects.

61,300 görüntüleme

At MIT, the only course I ever dropped was signal processing. The DFT math was too intimidating. It’s so easy to just type fft() in MATLAB and move on. Years later, I finally did DFT by hand. ✍️ If you are also afraid of DFT, I hope this helps! ⬇️ Download:

At MIT, the only course I ever dropped was signal processing. The DFT math was too intimidating. It’s so easy to just type fft() in MATLAB and move on. Years later, I finally did DFT by hand. ✍️ If you are also afraid of DFT, I hope this helps! ⬇️ Download:

259,960 görüntüleme

At MIT, I learned about RNNs in my NLP class with Prof. Michael Collins. He built a model from my keystrokes to predict who I was. To me, it felt like a magic box. Years later, when I had to teach RNNs, I forced myself to go inside the box. ⬇️ Download: First, with a tiny example on paper by hand ✍️. Then, a slightly larger one in Excel. That’s when it finally clicked: 👉 The weights are reused (weight matrices on the left side) 👉 The hidden states are passed down (H's) When I built it by hand and saw everything visually, it clicked, just math you can actually trace. Now I try to give others the same “aha!” moment I had. ⬇️ Download Excel:

At MIT, I learned about RNNs in my NLP class with Prof. Michael Collins. He built a model from my keystrokes to predict who I was. To me, it felt like a magic box. Years later, when I had to teach RNNs, I forced myself to go inside the box. ⬇️ Download: First, with a tiny example on paper by hand ✍️. Then, a slightly larger one in Excel. That’s when it finally clicked: 👉 The weights are reused (weight matrices on the left side) 👉 The hidden states are passed down (H's) When I built it by hand and saw everything visually, it clicked, just math you can actually trace. Now I try to give others the same “aha!” moment I had. ⬇️ Download Excel:

220,580 görüntüleme

Single vs Multi-hand Attention by hand ✍️ Resize matrices yourself 👉 The most important fact about multi-head attention: it has the same parameter count as single-head attention. The difference is purely structural — same total Wqkv weights, partitioned into smaller q–k–v triples. Look at the two diagrams below. Both Wqkv matrices have the same height — same number of weight rows, same number of parameters. What changes is how that single tall block is sliced. • Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors. • Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples — each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left. The compute trade-off — kind of. Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H. • Single-head: seq × seq = 40² = 1600 dot products • Multi-head: seq × seq × H = 40² × 3 = 4800 dot products (3×) But each multi-head dot product is narrower — its inner dimension is head_dim instead of H × head_dim. So when you count actual scalar multiplications, the totals are equal: • Single-head: seq² × (H × head_dim) = 40² × 36 = 57600 • Multi-head: seq² × H × head_dim = 40² × 3 × 12 = 57600 Same FLOPs. Multi-head buys you H independent attention patterns at no extra weight cost and no extra arithmetic cost — it's the same total compute, sliced into H finer-grained heads.

Single vs Multi-hand Attention by hand ✍️ Resize matrices yourself 👉 The most important fact about multi-head attention: it has the same parameter count as single-head attention. The difference is purely structural — same total Wqkv weights, partitioned into smaller q–k–v triples. Look at the two diagrams below. Both Wqkv matrices have the same height — same number of weight rows, same number of parameters. What changes is how that single tall block is sliced. • Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors. • Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples — each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left. The compute trade-off — kind of. Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H. • Single-head: seq × seq = 40² = 1600 dot products • Multi-head: seq × seq × H = 40² × 3 = 4800 dot products (3×) But each multi-head dot product is narrower — its inner dimension is head_dim instead of H × head_dim. So when you count actual scalar multiplications, the totals are equal: • Single-head: seq² × (H × head_dim) = 40² × 36 = 57600 • Multi-head: seq² × H × head_dim = 40² × 3 × 12 = 57600 Same FLOPs. Multi-head buys you H independent attention patterns at no extra weight cost and no extra arithmetic cost — it's the same total compute, sliced into H finer-grained heads.

35,002 görüntüleme

ReLU vs Leaky ReLU 👉 = ReLU = ReLU is the default activation in modern deep learning — cheap to compute, and stable enough to train networks hundreds of layers deep. To see what it does, picture five boba tea shops on the same block — 𝚊, 𝚋, 𝚌, 𝚍, 𝚎 — each running their own books. Each value is a shop's monthly profit — receipts minus rent, ingredients, and wages. When profit is positive, the shop stays open and the owner pockets every dollar. When profit turns negative, the shop runs out of cash and shutters — the lights go off, the books are wiped to zero. ReLU is exactly that rule, applied one shop at a time. Read the diagram left to right. The first column is the raw value x — each shop's profit at month's end. The second column is the gate: 1 if the shop is open (x > 0), 0 if it has shuttered. The last column is the ReLU output: open shops pass their profit through untouched, while shuttered ones are zeroed out. Five rows means five parallel shops on the same block, each evaluated independently. That's why ReLU is called an element-wise activation: every neuron decides its own fate. = LeakyRelu = Plain ReLU wipes negative values to zero — clean, but a shop that shutters can never recover, since both its output and its gradient stay pinned at zero. This is the dying ReLU problem, and in deep networks it can quietly kill a meaningful fraction of the units. Leaky ReLU is the one-line fix: instead of shuttering, the shop files for Chapter 11 protection and keeps the lights on at reduced capacity. Its debt is restructured down to a fraction α (typically 0.1) — the rest is forgiven, and the shop is wounded, not killed. A small negative signal still flows through, so the gradient survives, and the shop can crawl back to life if a TikTok goes viral. Read the diagram left to right. The first column is the raw value x — each shop's profit at month's end. The second column is the leakage α — the fraction of the loss held over after restructuring (default 0.1, editable). The third column is the gate: 1 for shops still in the black, α for those operating under bankruptcy protection. The last column is the Leaky ReLU output: y = x · gate. Profitable shops pass through untouched; struggling ones shrink by a factor of α but still carry a sign. Five rows means five parallel shops, each evaluated independently. Like ReLU, this is an element-wise activation: every neuron's fate is decided on its own merits. #aibyhahd

ReLU vs Leaky ReLU 👉 = ReLU = ReLU is the default activation in modern deep learning — cheap to compute, and stable enough to train networks hundreds of layers deep. To see what it does, picture five boba tea shops on the same block — 𝚊, 𝚋, 𝚌, 𝚍, 𝚎 — each running their own books. Each value is a shop's monthly profit — receipts minus rent, ingredients, and wages. When profit is positive, the shop stays open and the owner pockets every dollar. When profit turns negative, the shop runs out of cash and shutters — the lights go off, the books are wiped to zero. ReLU is exactly that rule, applied one shop at a time. Read the diagram left to right. The first column is the raw value x — each shop's profit at month's end. The second column is the gate: 1 if the shop is open (x > 0), 0 if it has shuttered. The last column is the ReLU output: open shops pass their profit through untouched, while shuttered ones are zeroed out. Five rows means five parallel shops on the same block, each evaluated independently. That's why ReLU is called an element-wise activation: every neuron decides its own fate. = LeakyRelu = Plain ReLU wipes negative values to zero — clean, but a shop that shutters can never recover, since both its output and its gradient stay pinned at zero. This is the dying ReLU problem, and in deep networks it can quietly kill a meaningful fraction of the units. Leaky ReLU is the one-line fix: instead of shuttering, the shop files for Chapter 11 protection and keeps the lights on at reduced capacity. Its debt is restructured down to a fraction α (typically 0.1) — the rest is forgiven, and the shop is wounded, not killed. A small negative signal still flows through, so the gradient survives, and the shop can crawl back to life if a TikTok goes viral. Read the diagram left to right. The first column is the raw value x — each shop's profit at month's end. The second column is the leakage α — the fraction of the loss held over after restructuring (default 0.1, editable). The third column is the gate: 1 for shops still in the black, α for those operating under bankruptcy protection. The last column is the Leaky ReLU output: y = x · gate. Profitable shops pass through untouched; struggling ones shrink by a factor of α but still carry a sign. Five rows means five parallel shops, each evaluated independently. Like ReLU, this is an element-wise activation: every neuron's fate is decided on its own merits. #aibyhahd

31,072 görüntüleme

mHC from DeepSeek. I implemented it in Excel for my Frontier AI Seminar. What is mHC? mHC stands for Manifold-Constrained Hyper Connections, published just a few days ago. This paper has quickly become the "first paper to read in 2026" for many in the community. Here's the gist: Residual Network = Neural Network + Skip Connections Hyper Connections = 1 skip connection per block to N weighted skip connections per block Manifold-Constrained = Whatever weight values to weights constrained to sum of 1 I always appreciate the DeepSeek team for releasing their work openly and quickly. That said, this paper introduces significantly more terminology than their earlier papers, and I’m not entirely sure who the intended audience is. Just to name a few: - Residual stream - Sinkhorn–Knopp algorithm - Entropically projected matrices - Birkhoff polytope - Doubly stochastic matrices 🤯 It’s a good reminder that open source is not the same as open knowledge—which is why I’d like to unpack mHC by hand, in Excel, in the next Frontier AI Seminar.

mHC from DeepSeek. I implemented it in Excel for my Frontier AI Seminar. What is mHC? mHC stands for Manifold-Constrained Hyper Connections, published just a few days ago. This paper has quickly become the "first paper to read in 2026" for many in the community. Here's the gist: Residual Network = Neural Network + Skip Connections Hyper Connections = 1 skip connection per block to N weighted skip connections per block Manifold-Constrained = Whatever weight values to weights constrained to sum of 1 I always appreciate the DeepSeek team for releasing their work openly and quickly. That said, this paper introduces significantly more terminology than their earlier papers, and I’m not entirely sure who the intended audience is. Just to name a few: - Residual stream - Sinkhorn–Knopp algorithm - Entropically projected matrices - Birkhoff polytope - Doubly stochastic matrices 🤯 It’s a good reminder that open source is not the same as open knowledge—which is why I’d like to unpack mHC by hand, in Excel, in the next Frontier AI Seminar.

93,743 görüntüleme

I’ve implemented SVMs three times in my life — first in MATLAB as a grad student, again in C++ during my postdoc, and finally by hand ✍️ as a professor who needed to truly understand it well enough to teach it. If you’re curious, here’s my full 19-step walkthrough: ---- 100% original, made by hand ✍️ Join 52K+ readers of my newsletter:

I’ve implemented SVMs three times in my life — first in MATLAB as a grad student, again in C++ during my postdoc, and finally by hand ✍️ as a professor who needed to truly understand it well enough to teach it. If you’re curious, here’s my full 19-step walkthrough: ---- 100% original, made by hand ✍️ Join 52K+ readers of my newsletter:

103,726 görüntüleme

Vector Database by Hand ✍️ Vector databases are revolutionizing how we search and analyze complex data. They have become the backbone of Retrieval Augmented Generation (#RAG). How do vector databases work? [1] Given ↳ A dataset of three sentences, each has 3 words (or tokens) ↳ In practice, a dataset may contain millions or billions of sentences. The max number of tokens may be tens of thousands (e.g., 32,768 mistral-7b). Process "how are you" [2] 🟨 Word Embeddings ↳ For each word, look up corresponding word embedding vector from a table of 22 vectors, where 22 is the vocabulary size. ↳ In practice, the vocabulary size can be tens of thousands. The word embedding dimensions are in the thousands (e.g., 1024, 4096) [3] 🟩 Encoding ↳ Feed the sequence of word embeddings to an encoder to obtain a sequence of feature vectors, one per word. ↳ Here, the encoder is a simple one layer perceptron (linear layer + ReLU) ↳ In practice, the encoder is a transformer or one of its many variants. [4] 🟩 Mean Pooling ↳ Merge the sequence of feature vectors into a single vector using "mean pooling" which is to average across the columns. ↳ The result is a single vector. We often call it "text embeddings" or "sentence embeddings." ↳ Other pooling techniques are possible, such as CLS. But mean pooling is the most common. [5] 🟦 Indexing ↳ Reduce the dimensions of the text embedding vector by a projection matrix. The reduction rate is 50% (4->2). ↳ In practice, the values in this projection matrix is much more random. ↳ The purpose is similar to that of hashing, which is to obtain a short representation to allow faster comparison and retrieval. ↳ The resulting dimension-reduced index vector is saved in the vector storage. [6] Process "who are you" ↳ Repeat [2]-[5] [7] Process "who am I" ↳ Repeat [2]-[5] Now we have indexed our dataset in the vector database. [8] 🟥 Query: "am I you" ↳ Repeat [2]-[5] ↳ The result is a 2-d query vector. [9] 🟥 Dot Products ↳ Take dot product between the query vector and database vectors. They are all 2-d. ↳ The purpose is to use dot product to estimate similarity. ↳ By transposing the query vector, this step becomes a matrix multiplication. [10] 🟥 Nearest Neighbor ↳ Find the largest dot product by linear scan. ↳ The sentence with the highest dot product is "who am I" ↳ In practice, because scanning billions of vectors is slow, we use an Approximate Nearest Neighbor (ANN) algorithm like the Hierarchical Navigable Small Worlds (HNSW).

Vector Database by Hand ✍️ Vector databases are revolutionizing how we search and analyze complex data. They have become the backbone of Retrieval Augmented Generation (#RAG). How do vector databases work? [1] Given ↳ A dataset of three sentences, each has 3 words (or tokens) ↳ In practice, a dataset may contain millions or billions of sentences. The max number of tokens may be tens of thousands (e.g., 32,768 mistral-7b). Process "how are you" [2] 🟨 Word Embeddings ↳ For each word, look up corresponding word embedding vector from a table of 22 vectors, where 22 is the vocabulary size. ↳ In practice, the vocabulary size can be tens of thousands. The word embedding dimensions are in the thousands (e.g., 1024, 4096) [3] 🟩 Encoding ↳ Feed the sequence of word embeddings to an encoder to obtain a sequence of feature vectors, one per word. ↳ Here, the encoder is a simple one layer perceptron (linear layer + ReLU) ↳ In practice, the encoder is a transformer or one of its many variants. [4] 🟩 Mean Pooling ↳ Merge the sequence of feature vectors into a single vector using "mean pooling" which is to average across the columns. ↳ The result is a single vector. We often call it "text embeddings" or "sentence embeddings." ↳ Other pooling techniques are possible, such as CLS. But mean pooling is the most common. [5] 🟦 Indexing ↳ Reduce the dimensions of the text embedding vector by a projection matrix. The reduction rate is 50% (4->2). ↳ In practice, the values in this projection matrix is much more random. ↳ The purpose is similar to that of hashing, which is to obtain a short representation to allow faster comparison and retrieval. ↳ The resulting dimension-reduced index vector is saved in the vector storage. [6] Process "who are you" ↳ Repeat [2]-[5] [7] Process "who am I" ↳ Repeat [2]-[5] Now we have indexed our dataset in the vector database. [8] 🟥 Query: "am I you" ↳ Repeat [2]-[5] ↳ The result is a 2-d query vector. [9] 🟥 Dot Products ↳ Take dot product between the query vector and database vectors. They are all 2-d. ↳ The purpose is to use dot product to estimate similarity. ↳ By transposing the query vector, this step becomes a matrix multiplication. [10] 🟥 Nearest Neighbor ↳ Find the largest dot product by linear scan. ↳ The sentence with the highest dot product is "who am I" ↳ In practice, because scanning billions of vectors is slow, we use an Approximate Nearest Neighbor (ANN) algorithm like the Hierarchical Navigable Small Worlds (HNSW).

191,919 görüntüleme

I built MatmulFlow ( — an interactive tool that makes matrix multiplication dimensions visual, part of my AI by Hand ✍️ series. Matrix multiplication dimensions are confusing. Which is the inner dimension? Columns of the first or rows of the second? And when you chain five multiplications together, it gets worse. The idea: represent matrices as rectangles. Shift the second matrix up and to the right. The edges that must align become obvious. The result fills in the remaining space. No memorization. You can see it. It extends to chains. Stack vertically for left-multiplication. Stack horizontally for right-multiplication. Resize any matrix and watch the dimensions "flow" through the entire chain. Give it a try!

I built MatmulFlow ( — an interactive tool that makes matrix multiplication dimensions visual, part of my AI by Hand ✍️ series. Matrix multiplication dimensions are confusing. Which is the inner dimension? Columns of the first or rows of the second? And when you chain five multiplications together, it gets worse. The idea: represent matrices as rectangles. Shift the second matrix up and to the right. The edges that must align become obvious. The result fills in the remaining space. No memorization. You can see it. It extends to chains. Stack vertically for left-multiplication. Stack horizontally for right-multiplication. Resize any matrix and watch the dimensions "flow" through the entire chain. Give it a try!

25,992 görüntüleme

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. 👇Join the 'AI Math' community. Download xlsx.

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. 👇Join the 'AI Math' community. Download xlsx.

101,555 görüntüleme

Backpropagation by hand✍️ ~ spreadsheet. I designed this exercise to show it is possible to calculate backpropagation for a non-trivial, three-layer network by hand. p.s. I just started this community to share useful resources on AI math. 👇 Join the community.

Backpropagation by hand✍️ ~ spreadsheet. I designed this exercise to show it is possible to calculate backpropagation for a non-trivial, three-layer network by hand. p.s. I just started this community to share useful resources on AI math. 👇 Join the community.

121,825 görüntüleme

AlphaFold by hand✍️ Excel ~ I designed this exercise to show (1) MSA multi-head attention, (2) Pair triangular update, two key components of the EvoFormer architecture.👇Join the AI Math community. Download xlsx.

AlphaFold by hand✍️ Excel ~ I designed this exercise to show (1) MSA multi-head attention, (2) Pair triangular update, two key components of the EvoFormer architecture.👇Join the AI Math community. Download xlsx.

104,965 görüntüleme

[Discrete Fourier Transform] by Hand ✍️ In signal processing, the Discrete Fourier Transform (DFT) is no doubt the most important method. But the math involved is extremely complex, literally, involving a summation over a complex number term e^(-iwt). I developed this exercise to demonstrate that underneath such complexity, DFT is just a series of matrix multiplications you can calculate by hand. ✍️ Once you see that, it should not surprise you that a deep neural network, which is also a series of matrix multiplications, with activation functions in-between, can learn to perform DFT to process and analyze signals so effectively. How does DFT work? [1] Given ↳ Signals A, B, and C in the 🟧 frequency domain: ◦ A = cos(w) + 2cos(2w) ◦ B = cos(w) + cos(3w) + cos(4w) ◦ C = -cos(2w) + cos(3w) ◦ Each signal is a weighed sum of four cosine waves at frequencies 1w, 2w, 3w, and 4w. ◦ We will apply Inverse DFT to convert the signals to time domain representations, and then demonstrate DFT can convert back to their original frequency domain representations. ↳ Signal X in the 🟩 time domain. X is sampled at 10 time points 1t, 2t, …, 10t: ◦ X = [-2.5, -1.8, 3, -0.7, -1.0, -0.7, 3, -1.8, -2.5, 5] ◦ Suppose X is also a weighted sum of the same four cosine waves, but we don’t already know their weights. We will apply DFT to discover them. [2] 🟧 Frequency Matrix (F) ↳ Write the coefficients of A, B, C as a matrix F. Each signal is a row. Each frequency is a column. ↳ A → [1, 2, 0, 0] ↳ B → [1, 0, 1, 1] ↳ C → [0, 1-, 1, 0] [3] Cosine → Discrete ↳ Sample from the continuous cosine waves at discrete time points 1t, 2t, 3t, to 10t. [4] Cosine Matrix (W) ↳ Write the samples as a matrix, Each frequency is a row. Each time point is a column. [5] Inverse DFT: 🟧 Frequency → 🟩 Time ↳ Multiply the frequency matrix F and the cosine matrix W. ↳ The meaning of this multiplication is to linearly combine the four cosine waves (rows in W) into time-domain signals (rows in T) using the weights specified in F. ↳ The result is matrix T, which are signals A, B, C converted to the time domain. Each signal is a row. Each time point is a column. [6] Transpose ↳ Transpose T, converting each signal’s time domain representation from a row to a column. [7] DFT: 🟩 Time → 🟧 Frequency ↳ Multiply the cosine matrix W with the transpose of matrix T. ↳ The purpose of this multiplication is to take a dot-product between each time-domain signal (columns in the transpose of T) and each cosine wave (rows in W), which has the effect of projecting the signal onto a cosine wave to determine how much they are correlated. Zero means not correlated at all. ↳ The result is an intermediate version of the “recovered” frequency matrix where each column corresponds to a signal and each row corresponds to a frequency. ↳ Compared to the original frequency matrix F, this intermediate matrix has non-zero weights in the correct places, but scaled up by a factor of 5 (n/2, n=10). For example, signal A, originally [1,2,0,0], is recovered at [5,10,0,0]. [8] Scale ↳ Multiply each value by 2/n = 1/5 to scale down the intermediate matrix to match the magnitude of the original frequency matrix F. [9] Transpose ↳ Transpose the recovered frequency matrix back to the same orientation of the original frequency matrix F. ↳ Like magic 🪄, the result is identical to the original F, which means DFT successfully recovered the frequency components of signals A, B, C. [10] Apply DFT to X: 🟩 Time → 🟧 Frequency ↳ Now that we have some confidence in DFT’s ability to recover frequency components, we apply DFT to X’s time-domain representation by multiplying W with X. ↳ The result is the an intermediate matrix. [11] Scale ↳ Similarly, we scale down by a factor of 5 to obtain the recovered frequency components of X (a column). [12] Transpose ↳ Similarly, we transpose the recovered column to row to match the orientation of the frequency matrix. ↳ Using the coefficients [0,0,3,2], we can write the equation of X as 3cos(3w) + 2cos(4w). Notes: I hope this by hand exercise helps you understand the essence of DFT. But there is more technical details, such as: • Sine: The complete DFT math also includes sine waves that follow a similar calculation process. • Phase: Here, we assume all the cosine waves are aligned at the origin, namely, phase is 0. If a phase p is added, for example, cos(w+p), we will need to calculate the sine component and use their ratio to figure out what p is. • Magnitude: If phase is not zero, the magnitude will need to be calculated by combining both cosine and sine terms.

[Discrete Fourier Transform] by Hand ✍️ In signal processing, the Discrete Fourier Transform (DFT) is no doubt the most important method. But the math involved is extremely complex, literally, involving a summation over a complex number term e^(-iwt). I developed this exercise to demonstrate that underneath such complexity, DFT is just a series of matrix multiplications you can calculate by hand. ✍️ Once you see that, it should not surprise you that a deep neural network, which is also a series of matrix multiplications, with activation functions in-between, can learn to perform DFT to process and analyze signals so effectively. How does DFT work? [1] Given ↳ Signals A, B, and C in the 🟧 frequency domain: ◦ A = cos(w) + 2cos(2w) ◦ B = cos(w) + cos(3w) + cos(4w) ◦ C = -cos(2w) + cos(3w) ◦ Each signal is a weighed sum of four cosine waves at frequencies 1w, 2w, 3w, and 4w. ◦ We will apply Inverse DFT to convert the signals to time domain representations, and then demonstrate DFT can convert back to their original frequency domain representations. ↳ Signal X in the 🟩 time domain. X is sampled at 10 time points 1t, 2t, …, 10t: ◦ X = [-2.5, -1.8, 3, -0.7, -1.0, -0.7, 3, -1.8, -2.5, 5] ◦ Suppose X is also a weighted sum of the same four cosine waves, but we don’t already know their weights. We will apply DFT to discover them. [2] 🟧 Frequency Matrix (F) ↳ Write the coefficients of A, B, C as a matrix F. Each signal is a row. Each frequency is a column. ↳ A → [1, 2, 0, 0] ↳ B → [1, 0, 1, 1] ↳ C → [0, 1-, 1, 0] [3] Cosine → Discrete ↳ Sample from the continuous cosine waves at discrete time points 1t, 2t, 3t, to 10t. [4] Cosine Matrix (W) ↳ Write the samples as a matrix, Each frequency is a row. Each time point is a column. [5] Inverse DFT: 🟧 Frequency → 🟩 Time ↳ Multiply the frequency matrix F and the cosine matrix W. ↳ The meaning of this multiplication is to linearly combine the four cosine waves (rows in W) into time-domain signals (rows in T) using the weights specified in F. ↳ The result is matrix T, which are signals A, B, C converted to the time domain. Each signal is a row. Each time point is a column. [6] Transpose ↳ Transpose T, converting each signal’s time domain representation from a row to a column. [7] DFT: 🟩 Time → 🟧 Frequency ↳ Multiply the cosine matrix W with the transpose of matrix T. ↳ The purpose of this multiplication is to take a dot-product between each time-domain signal (columns in the transpose of T) and each cosine wave (rows in W), which has the effect of projecting the signal onto a cosine wave to determine how much they are correlated. Zero means not correlated at all. ↳ The result is an intermediate version of the “recovered” frequency matrix where each column corresponds to a signal and each row corresponds to a frequency. ↳ Compared to the original frequency matrix F, this intermediate matrix has non-zero weights in the correct places, but scaled up by a factor of 5 (n/2, n=10). For example, signal A, originally [1,2,0,0], is recovered at [5,10,0,0]. [8] Scale ↳ Multiply each value by 2/n = 1/5 to scale down the intermediate matrix to match the magnitude of the original frequency matrix F. [9] Transpose ↳ Transpose the recovered frequency matrix back to the same orientation of the original frequency matrix F. ↳ Like magic 🪄, the result is identical to the original F, which means DFT successfully recovered the frequency components of signals A, B, C. [10] Apply DFT to X: 🟩 Time → 🟧 Frequency ↳ Now that we have some confidence in DFT’s ability to recover frequency components, we apply DFT to X’s time-domain representation by multiplying W with X. ↳ The result is the an intermediate matrix. [11] Scale ↳ Similarly, we scale down by a factor of 5 to obtain the recovered frequency components of X (a column). [12] Transpose ↳ Similarly, we transpose the recovered column to row to match the orientation of the frequency matrix. ↳ Using the coefficients [0,0,3,2], we can write the equation of X as 3cos(3w) + 2cos(4w). Notes: I hope this by hand exercise helps you understand the essence of DFT. But there is more technical details, such as: • Sine: The complete DFT math also includes sine waves that follow a similar calculation process. • Phase: Here, we assume all the cosine waves are aligned at the origin, namely, phase is 0. If a phase p is added, for example, cos(w+p), we will need to calculate the sine component and use their ratio to figure out what p is. • Magnitude: If phase is not zero, the magnitude will need to be calculated by combining both cosine and sine terms.

116,622 görüntüleme

[LSTM] by Hand ✍️ LSTMs have been the most effective architecture to process long sequences of data, until our world was taken over by the Transformers. LSTMs belong to the broader family of recurrent neural network (RNNs) that process data sequentially in a recurrent manner. Transformers, on the other hand, abandon recurrence and use self-attention instead to process data concurrently in parallel. Recently, there is renewed interest in recurrence as people realized self-attention doesn’t scale to extremely long sequences, like hundreds of thousands of tokens. Mamba is a good example to bring back recurrence. All of a sudden, it is cool to study LSTMs. How do LSTMs work? [1] Given ↳ 🟨 Input sequence X1, X2, X3 (d = 3) ↳ 🟩 Hidden state h (d = 2) ↳ 🟦 Memory C (d = 2) ↳ Weight matrices Wf, Wc, Wi, Wo Process t = 1 [2] Initialize ↳ Randomly set the previous hidden state h0 to [1, 1] and memory cells C0 to [0.3, -0.5] [3] Linear Transform ↳ Multiply the four weight matrices with the concatenation of current input (X1) and the previous hidden state (h0). ↳ The results are feature values, each is a linear combination of the current input and hidden state. [4] Non-linear Transform ↳ Apply sigmoid σ to obtain gate values (between 0 and 1). • Forget gate (f1): [-4, -6] → [0, 0] • Input gate (i1): [6, 4] → [1, 1] • Output gate (o1): [4, -5] → [1, 0] ↳ Apply tanh to obtain candidate memory values (between -1 and 1) • Candidate memory (C’1): [1, -6] → [0.8, -1] [5] Update Memory ↳ Forget (C0 .* f1): Element-wise multiply the current memory with forget gate values. ↳ Input (C’1 .* o1): Element-wise multiply the “candidate” memory with input gate values. ↳ Update the memory to C1 by adding the two terms above: C0 .* f1 + C’1 .* o1 = C1 [6] Candiate Output ↳ Apply tanh to the new memory C1 to obtain candidate output o’1. [0.8, -1] → [0.7, -0.8] [7] Update Hidden State ↳ Output (o’1 .* o1 → h1): Element-wise multiply the candidate output with the output gate. ↳ The result is updated hidden state h1 ↳ Also, it is the first output. Process t = 2 [8] Initialize ↳ Copy previous hidden state h1 and memory C1 [9] Linear Transform ↳ Repeat [3] [10] Update Memory (C2) ↳ Repeat [4] and [5] [11] Update Hidden State (h2) ↳ Repeat [6] and [7] Process t = 3 [12] Initialize ↳ Copy previous hidden state h2 and memory C2 [13] Linear Transform ↳ Repeat [3] [14] Update Memory (C3) ↳ Repeat [4] and [5] [15] Update Hidden State (h3) ↳ Repeat [6] and [7]

[LSTM] by Hand ✍️ LSTMs have been the most effective architecture to process long sequences of data, until our world was taken over by the Transformers. LSTMs belong to the broader family of recurrent neural network (RNNs) that process data sequentially in a recurrent manner. Transformers, on the other hand, abandon recurrence and use self-attention instead to process data concurrently in parallel. Recently, there is renewed interest in recurrence as people realized self-attention doesn’t scale to extremely long sequences, like hundreds of thousands of tokens. Mamba is a good example to bring back recurrence. All of a sudden, it is cool to study LSTMs. How do LSTMs work? [1] Given ↳ 🟨 Input sequence X1, X2, X3 (d = 3) ↳ 🟩 Hidden state h (d = 2) ↳ 🟦 Memory C (d = 2) ↳ Weight matrices Wf, Wc, Wi, Wo Process t = 1 [2] Initialize ↳ Randomly set the previous hidden state h0 to [1, 1] and memory cells C0 to [0.3, -0.5] [3] Linear Transform ↳ Multiply the four weight matrices with the concatenation of current input (X1) and the previous hidden state (h0). ↳ The results are feature values, each is a linear combination of the current input and hidden state. [4] Non-linear Transform ↳ Apply sigmoid σ to obtain gate values (between 0 and 1). • Forget gate (f1): [-4, -6] → [0, 0] • Input gate (i1): [6, 4] → [1, 1] • Output gate (o1): [4, -5] → [1, 0] ↳ Apply tanh to obtain candidate memory values (between -1 and 1) • Candidate memory (C’1): [1, -6] → [0.8, -1] [5] Update Memory ↳ Forget (C0 .* f1): Element-wise multiply the current memory with forget gate values. ↳ Input (C’1 .* o1): Element-wise multiply the “candidate” memory with input gate values. ↳ Update the memory to C1 by adding the two terms above: C0 .* f1 + C’1 .* o1 = C1 [6] Candiate Output ↳ Apply tanh to the new memory C1 to obtain candidate output o’1. [0.8, -1] → [0.7, -0.8] [7] Update Hidden State ↳ Output (o’1 .* o1 → h1): Element-wise multiply the candidate output with the output gate. ↳ The result is updated hidden state h1 ↳ Also, it is the first output. Process t = 2 [8] Initialize ↳ Copy previous hidden state h1 and memory C1 [9] Linear Transform ↳ Repeat [3] [10] Update Memory (C2) ↳ Repeat [4] and [5] [11] Update Hidden State (h2) ↳ Repeat [6] and [7] Process t = 3 [12] Initialize ↳ Copy previous hidden state h2 and memory C2 [13] Linear Transform ↳ Repeat [3] [14] Update Memory (C3) ↳ Repeat [4] and [5] [15] Update Hidden State (h3) ↳ Repeat [6] and [7]

72,860 görüntüleme

[Backpropagation] by Hand✍️ [1] Forward Pass ↳ Given a multi layer perceptron (3 levels), an input vector X, predictions Y^{Pred} = [0.5, 0.5, 0], and ground truth label Y^{Target} = [0, 1, 0]. [2] Backpropagation ↳ Insert cells to hold our calculations. [3] Layer 3 - Softmax (blue) ↳ Calculate ∂L / ∂z3 directly using the simple equation: Y^{Pred} - Y^{Target} = [0.5, -0.5, 0]. ↳ This simple equation is the benefit of using Softmax and Cross Entropy Loss together. [4] Layer 3 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W3 and ∂L / ∂b3 by multiplying ∂L / ∂z3 and [ a2 | 1 ]. [5] Layer 2 - Activations (green) ↳ Calculate ∂L / ∂a2 by multiplying ∂L / ∂z3 and W3. [6] Layer 2 - ReLU (blue) ↳ Calculate ∂L / ∂z2 by multiplying ∂L / ∂a2 with 1 for positive values and 0 otherwise. [7] Layer 2 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W2 and ∂L / ∂b2 by multiplying ∂L / ∂z2 and [ a1 | 1 ]. [8] Layer 1 - Activations (green) ↳ Calculate ∂L / ∂a1 by multiplying ∂L / ∂z2 and W2. [9] Layer 1 - ReLU (blue) ↳ Calculate ∂L / ∂z1 by multiplying ∂L / ∂a1 with 1 for positive values and 0 otherwise. [10] Layer 1 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W1 and ∂L / ∂b1 by multiplying ∂L / ∂z1 and [ x | 1 ]. [11] Gradient Descent ↳ Update weights and biases (typically a learning rate is applied here). 💡 Matrix Multiplication is All You Need: Just like in the forward pass, backpropagation is all about matrix multiplications. You can definitely do everything by hand as I demonstrated in this exercise, albeit slow and imperfect. This is why GPU's ability to multiply matrices efficiently plays such an important role in the deep learning evolution. This is why NVIDIA is now close to $1 trillion in valuation. 💡Exploding Gradients: We can already see the gradients are getting larger as we back-propagate up, even in this simple 3-layer network. This motivates using methods like skip connections to handle exploding (or diminishing) gradients as in the ResNet. I did the calculations entirely by hand. Please let me know if you spot any error or have any questions!

[Backpropagation] by Hand✍️ [1] Forward Pass ↳ Given a multi layer perceptron (3 levels), an input vector X, predictions Y^{Pred} = [0.5, 0.5, 0], and ground truth label Y^{Target} = [0, 1, 0]. [2] Backpropagation ↳ Insert cells to hold our calculations. [3] Layer 3 - Softmax (blue) ↳ Calculate ∂L / ∂z3 directly using the simple equation: Y^{Pred} - Y^{Target} = [0.5, -0.5, 0]. ↳ This simple equation is the benefit of using Softmax and Cross Entropy Loss together. [4] Layer 3 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W3 and ∂L / ∂b3 by multiplying ∂L / ∂z3 and [ a2 | 1 ]. [5] Layer 2 - Activations (green) ↳ Calculate ∂L / ∂a2 by multiplying ∂L / ∂z3 and W3. [6] Layer 2 - ReLU (blue) ↳ Calculate ∂L / ∂z2 by multiplying ∂L / ∂a2 with 1 for positive values and 0 otherwise. [7] Layer 2 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W2 and ∂L / ∂b2 by multiplying ∂L / ∂z2 and [ a1 | 1 ]. [8] Layer 1 - Activations (green) ↳ Calculate ∂L / ∂a1 by multiplying ∂L / ∂z2 and W2. [9] Layer 1 - ReLU (blue) ↳ Calculate ∂L / ∂z1 by multiplying ∂L / ∂a1 with 1 for positive values and 0 otherwise. [10] Layer 1 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W1 and ∂L / ∂b1 by multiplying ∂L / ∂z1 and [ x | 1 ]. [11] Gradient Descent ↳ Update weights and biases (typically a learning rate is applied here). 💡 Matrix Multiplication is All You Need: Just like in the forward pass, backpropagation is all about matrix multiplications. You can definitely do everything by hand as I demonstrated in this exercise, albeit slow and imperfect. This is why GPU's ability to multiply matrices efficiently plays such an important role in the deep learning evolution. This is why NVIDIA is now close to $1 trillion in valuation. 💡Exploding Gradients: We can already see the gradients are getting larger as we back-propagate up, even in this simple 3-layer network. This motivates using methods like skip connections to handle exploding (or diminishing) gradients as in the ResNet. I did the calculations entirely by hand. Please let me know if you spot any error or have any questions!

64,645 görüntüleme

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. I also made a medium version.👇Join the 'AI Math' community. Download xlsx.

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. I also made a medium version.👇Join the 'AI Math' community. Download xlsx.

54,473 görüntüleme

Retrieval Weighting RAG by hand ✍️ + Langflow . I am designing a series of exercises to teach advanced RAG techniques. This is No. 5. Previous exercises are in the comment.

Retrieval Weighting RAG by hand ✍️ + Langflow . I am designing a series of exercises to teach advanced RAG techniques. This is No. 5. Previous exercises are in the comment.

48,408 görüntüleme

[Graph Convolutional Network] by hand ✍️ Graph Convolutional Networks (GCNs), introduced by Thomas Kipf and Max Welling in 2017, have emerged as a powerful tool in the analysis and interpretation of data structured as graphs. This exercise demonstrates how GCN works in a simple application: binary classification. -- Goal -- Predict if a node in a graph is X. -- Architecture -- 🟪 Graph Convolutional Network (GCN) 1. GCN1(4,3) 2. GCN2(3,3) 🟦 Fully Connected Network (FCN) 1. Linear1(3,5) 2. ReLU 3. Linear2(5,1) 4. Sigmoid Simplications: • Adjacent matrices are not normalized. • ReLU is applied to messages directly. -- Walkthrough -- [1] Given ↳ A graph with five nodes A, B, C, D, E [2] 🟩 Adjacency Matrix: Neighbors ↳ Add 1 for each edge to neighbors ↳ Repeat in both directions (e.g., A->C, C->A) ↳ Repeat for both GCN layers [3] 🟩 Adjacency Matrix: Self ↳ Add 1's for each self loop ↳ Equivalent to adding the identity matrix ↳ Repeat for both GCN layers [4] 🟪 GCN1: Messages ↳ Multiply the node embeddings 🟨 with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is one message per node [5] 🟪 GCN1: Pooling ↳ Multiply the messages with the adjacent matrix ↳ The purpose is the pool messages from each node's neighbors as well as from the node itself. ↳ The result is a new feature per node [6] 🟪 GCN1: Visualize ↳ For node 1, visualize how messages are pooled to obtain a new feature for better understanding ↳ [3,0,1] + [1,0,0] = [4,0,1] [7] 🟪 GCN2: Messages ↳ Multiply the node features with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is one message per node [8] 🟪 GCN2: Pooling ↳ Multiply the messages with the adjacent matrix ↳ The result is a new feature per node [9] 🟪 GCN2: Visualize ↳ For node 3, visualize how messages are pooled to obtain a new feature for better understanding ↳ [1,2,4] + [1,3,5] + [0,0,1] = [2,5,10] [10] 🟦 FCN: Linear 1 + ReLU ↳ Multiply node features with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is a new feature per node ↳ Unlike in GCN layers, no messages from other nodes are included. [11] 🟦 FCN: Linear 2 ↳ Multiply node features with weights and biases [12] 🟦 FCN: Sigmoid ↳ Apply the Sigmoid activation function ↳ The purpose is to obtain a probability value for each node ↳ One way to calculate Sigmoid by hand ✍️ is to use the approximation below: • >= 3 → 1 • 0 → 0.5 • <= -3 → 0 -- Outputs -- A: 0 (Very unlikely) B: 1 (Very likely) C: 1 (Very likely) D: 1 (Very likely) E: 0.5 (Neutral)

[Graph Convolutional Network] by hand ✍️ Graph Convolutional Networks (GCNs), introduced by Thomas Kipf and Max Welling in 2017, have emerged as a powerful tool in the analysis and interpretation of data structured as graphs. This exercise demonstrates how GCN works in a simple application: binary classification. -- Goal -- Predict if a node in a graph is X. -- Architecture -- 🟪 Graph Convolutional Network (GCN) 1. GCN1(4,3) 2. GCN2(3,3) 🟦 Fully Connected Network (FCN) 1. Linear1(3,5) 2. ReLU 3. Linear2(5,1) 4. Sigmoid Simplications: • Adjacent matrices are not normalized. • ReLU is applied to messages directly. -- Walkthrough -- [1] Given ↳ A graph with five nodes A, B, C, D, E [2] 🟩 Adjacency Matrix: Neighbors ↳ Add 1 for each edge to neighbors ↳ Repeat in both directions (e.g., A->C, C->A) ↳ Repeat for both GCN layers [3] 🟩 Adjacency Matrix: Self ↳ Add 1's for each self loop ↳ Equivalent to adding the identity matrix ↳ Repeat for both GCN layers [4] 🟪 GCN1: Messages ↳ Multiply the node embeddings 🟨 with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is one message per node [5] 🟪 GCN1: Pooling ↳ Multiply the messages with the adjacent matrix ↳ The purpose is the pool messages from each node's neighbors as well as from the node itself. ↳ The result is a new feature per node [6] 🟪 GCN1: Visualize ↳ For node 1, visualize how messages are pooled to obtain a new feature for better understanding ↳ [3,0,1] + [1,0,0] = [4,0,1] [7] 🟪 GCN2: Messages ↳ Multiply the node features with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is one message per node [8] 🟪 GCN2: Pooling ↳ Multiply the messages with the adjacent matrix ↳ The result is a new feature per node [9] 🟪 GCN2: Visualize ↳ For node 3, visualize how messages are pooled to obtain a new feature for better understanding ↳ [1,2,4] + [1,3,5] + [0,0,1] = [2,5,10] [10] 🟦 FCN: Linear 1 + ReLU ↳ Multiply node features with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is a new feature per node ↳ Unlike in GCN layers, no messages from other nodes are included. [11] 🟦 FCN: Linear 2 ↳ Multiply node features with weights and biases [12] 🟦 FCN: Sigmoid ↳ Apply the Sigmoid activation function ↳ The purpose is to obtain a probability value for each node ↳ One way to calculate Sigmoid by hand ✍️ is to use the approximation below: • >= 3 → 1 • 0 → 0.5 • <= -3 → 0 -- Outputs -- A: 0 (Very unlikely) B: 1 (Very likely) C: 1 (Very likely) D: 1 (Very likely) E: 0.5 (Neutral)

46,499 görüntüleme

Videos

ProfTomYeh's profile picture

SORA by Hand ✍️ OpenAI’s #SORA took over the Internet when it was announced earlier this year. The technology behind Sora is the Diffusion Transformer (DiT) developed by William Peebles and Shining Xie. How does DiT work? 𝗚𝗼𝗮𝗹: Generate a video conditioned by a text prompt and a series of diffusion steps [1] Given ↳ Video ↳ Prompt: "sora is sky" ↳ Diffusion step: t = 3 [2] Video → Patches ↳ Divide all pixels in all frames into 4 spacetime patches [3] Visual Encoder: Pixels 🟨 → Latent 🟩 ↳ Multiply the patches with weights and biases, followed by ReLU ↳ The result is a latent feature vector per patch ↳ The purpose is dimension reduction from 4 (2x2x1) to 2 (2x1). ↳ In the paper, the reduction is 196,608 (256x256x3)→ 4096 (32x32x4) [4] ⬛ Add Noise ↳ Sample a noise according to the diffusion time step t. Typically, the larger the t, the smaller the noise. ↳ Add the Sampled Noise to latent features to obtain Noised Latent. ↳ The goal is to purposely add noise to a video and ask the model to guess what that noise is. ↳ This is analogous to training a language model by purposely deleting a word in a sentence and ask the model to guess what the deleted word was. [5-7] 🟪 Conditioning by Adaptive Layer Norm [5] Encode Conditions ↳ Encode "sora is sky" into a text embedding vector [0,1,-1]. ↳ Encode t = 3 to as a binary vector [1,1]. ↳ Concatenate the two vectors in to a 5D column vector. [6] Estimate Scale/Shift ↳ Multiply the combined vector with weights and biases ↳ The goal is to estimate the scale [2,-1] and shift [-1,5]. ↳ Copy the result to (X) and (+) [7] Apply Scale/Sift ↳ Scale the noised latent by [2,-1] ↳ Shifted the scaled noised latent by [-1, 5] ↳ The result is "conditioned" noise latent. [8-10] Transformer [8] Self-Attention ↳ Feed the conditioned noised latent to Query-Key function to obtain a self-attention matrix ↳ Value is omitted for simplicity [9] Attention Pooling ↳ Multiply the conditioned noised latent with the self-attention matrix ↳ The result are attention weighted features [10] Pointwise Feed Forward Network ↳ Multiply the attention weighted features with weights and biases ↳ The result is the Predicted Noise 🏋️‍♂️ 𝗧𝗿𝗮𝗶𝗻 [11] ↳ Calculate MSE loss gradients by taking the different between the Predicted Noise and the Sampled Noise (ground truth). ↳ Use the loss gradients to kick off backpropagation to update all learnable parameters (red borders) ↳ Note the visual encoder and decoder's parameters are frozen (blue borders) 🎨 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲 (𝗦𝗮𝗺𝗽𝗹𝗲) [12] Denoise ↳ Subtract the predicted noise from the noised latent to obtain the noise-free latent [13] Visual Decoder: Latent 🟩 → Pixels 🟨 ↳ Multiply the patches with weights and biases, followed by ReLU [14] Patches → Video ↳ Rearrange patches into a sequence of video frames.

Tom Yeh

238,065 görüntüleme • 2 yıl önce