Tom Yeh

@ProfTomYeh • 55,807 subscribers

CS Prof | AI by Hand ✍️ | @CUBoulder

Shorts

At MIT, the only course I ever dropped was signal processing. The DFT math was too intimidating. It’s so easy to just type fft() in MATLAB and move on. Years later, I finally did DFT by hand. ✍️ If you are also afraid of DFT, I hope this helps! ⬇️ Download:

260,027 просмотров

$Softmax vs Sigmoid ✍️ Interact 👉 = Softmax = Softmax is how deep networks turn raw scores into a probability distribution — the final layer of every classifier, and the core of every attention head in a transformer. To see what it does, picture five boba tea shops on the same block, all competing for your dollar. Five candidates: a, b, c, d, e — different chains, different brewing styles, different pearls. A boba reviewer hands you a 𝘤𝘩𝘦𝘸𝘪𝘯𝘦𝘴𝘴 𝘴𝘤𝘰𝘳𝘦 for each — higher means perfectly chewy "QQ" pearls with the right bite (ask a Taiwanese friend to find out what QQ means). Negative scores are real: mushy bobas, overcooked pearls, a batch left sitting too long. How do you turn five chewiness scores into an allocation that adds to a whole dollar? You could spend everything at the chewiest shop, but that ignores how good the runners-up are. Softmax is the smooth alternative. Read the diagram left to right. First, raise each score to e^{x} — this does two things: it turns negative chewiness into small positives, and it stretches the gaps between scores exponentially. Then sum all five into a single total Z. Finally, divide each e^{x} by Z to get a probability. The five probabilities add up to one, so you can read them as percentages of your dollar. The chewiest shop gets the biggest slice — but never the whole dollar. That's the point of softmax: it ranks confidently while still leaving room for the others. = Sigmoid = Sigmoid squashes any real number into a probability between 0 and 1 — the classic activation for binary classification, and still the gating function inside LSTMs and GRUs. Same boba block as the previous Softmax example, narrowed to just two contenders — a hot new shop `a` with chewiness score x, and your usual go-to `b` whose score is pinned at zero (the neutral baseline you've come to expect). Sigmoid is just softmax with two players, one of them pinned to zero. Read the diagram left to right. First, raise each score to e^{x} — for the usual shop `b` whose score is zero, this is just e^0 = 1 (the constant baseline). Then sum the two into a total Z. Finally, divide each e^{x} by Z to get a probability. The two probabilities add up to one — the new shop wins more of your dollar when its pearls get chewier, and your usual keeps the rest. That's the point of sigmoid: it turns a single chewiness score into a clean 0-to-1 chance you'll try the new place over your usual. --- AI Math, Algorithms, Architectures by hand ✍️ Subscribe to my 60K+ reader newsletter 👉$

Softmax vs Sigmoid ✍️ Interact 👉 = Softmax = Softmax is how deep networks turn raw scores into a probability distribution — the final layer of every classifier, and the core of every attention head in a transformer. To see what it does, picture five boba tea shops on the same block, all competing for your dollar. Five candidates: a, b, c, d, e — different chains, different brewing styles, different pearls. A boba reviewer hands you a 𝘤𝘩𝘦𝘸𝘪𝘯𝘦𝘴𝘴 𝘴𝘤𝘰𝘳𝘦 for each — higher means perfectly chewy "QQ" pearls with the right bite (ask a Taiwanese friend to find out what QQ means). Negative scores are real: mushy bobas, overcooked pearls, a batch left sitting too long. How do you turn five chewiness scores into an allocation that adds to a whole dollar? You could spend everything at the chewiest shop, but that ignores how good the runners-up are. Softmax is the smooth alternative. Read the diagram left to right. First, raise each score to e^{x} — this does two things: it turns negative chewiness into small positives, and it stretches the gaps between scores exponentially. Then sum all five into a single total Z. Finally, divide each e^{x} by Z to get a probability. The five probabilities add up to one, so you can read them as percentages of your dollar. The chewiest shop gets the biggest slice — but never the whole dollar. That's the point of softmax: it ranks confidently while still leaving room for the others. = Sigmoid = Sigmoid squashes any real number into a probability between 0 and 1 — the classic activation for binary classification, and still the gating function inside LSTMs and GRUs. Same boba block as the previous Softmax example, narrowed to just two contenders — a hot new shop `a` with chewiness score x, and your usual go-to `b` whose score is pinned at zero (the neutral baseline you've come to expect). Sigmoid is just softmax with two players, one of them pinned to zero. Read the diagram left to right. First, raise each score to e^{x} — for the usual shop `b` whose score is zero, this is just e^0 = 1 (the constant baseline). Then sum the two into a total Z. Finally, divide each e^{x} by Z to get a probability. The two probabilities add up to one — the new shop wins more of your dollar when its pearls get chewier, and your usual keeps the rest. That's the point of sigmoid: it turns a single chewiness score into a clean 0-to-1 chance you'll try the new place over your usual. --- AI Math, Algorithms, Architectures by hand ✍️ Subscribe to my 60K+ reader newsletter 👉

73,787 просмотров

At MIT, I learned about RNNs in my NLP class with Prof. Michael Collins. He built a model from my keystrokes to predict who I was. To me, it felt like a magic box. Years later, when I had to teach RNNs, I forced myself to go inside the box. ⬇️ Download: First, with a tiny example on paper by hand ✍️. Then, a slightly larger one in Excel. That’s when it finally clicked: 👉 The weights are reused (weight matrices on the left side) 👉 The hidden states are passed down (H's) When I built it by hand and saw everything visually, it clicked, just math you can actually trace. Now I try to give others the same “aha!” moment I had. ⬇️ Download Excel:

220,709 просмотров

Self Attention vs Cross Attention by hand ✍️ Resize the matrices yourself 👉 Two attention mechanisms, side by side. Both project X into queries; both compute attention via S = Kᵀ × Q and F = V × A. The only difference is the source of K and V. Self attention uses X for everything. Q, K, and V all come from projecting X. Each X token attends to every other X token. The score matrix S is square — 128 × 128. Cross attention uses X for queries and a second sequence E for keys and values. Each X token attends to every E token instead. The score matrix S is rectangular — 64 × 128. Notice what's shared and what's not: X is the same in both — same 36 × 128 input. Q and K share the 16 dimension — that's what makes the dot product Kᵀ × Q valid in either case. V dimensions are independent: self-attention uses 12, cross-attention uses 12. The choice doesn't depend on which mechanism you're using; it depends on what output dimension your downstream layer expects.

61,300 просмотров

mHC from DeepSeek. I implemented it in Excel for my Frontier AI Seminar. What is mHC? mHC stands for Manifold-Constrained Hyper Connections, published just a few days ago. This paper has quickly become the "first paper to read in 2026" for many in the community. Here's the gist: Residual Network = Neural Network + Skip Connections Hyper Connections = 1 skip connection per block to N weighted skip connections per block Manifold-Constrained = Whatever weight values to weights constrained to sum of 1 I always appreciate the DeepSeek team for releasing their work openly and quickly. That said, this paper introduces significantly more terminology than their earlier papers, and I’m not entirely sure who the intended audience is. Just to name a few: - Residual stream - Sinkhorn–Knopp algorithm - Entropically projected matrices - Birkhoff polytope - Doubly stochastic matrices 🤯 It’s a good reminder that open source is not the same as open knowledge—which is why I’d like to unpack mHC by hand, in Excel, in the next Frontier AI Seminar.

93,991 просмотров

I’ve implemented SVMs three times in my life — first in MATLAB as a grad student, again in C++ during my postdoc, and finally by hand ✍️ as a professor who needed to truly understand it well enough to teach it. If you’re curious, here’s my full 19-step walkthrough: ---- 100% original, made by hand ✍️ Join 52K+ readers of my newsletter:

103,824 просмотров

Single vs Multi-hand Attention by hand ✍️ Resize matrices yourself 👉 The most important fact about multi-head attention: it has the same parameter count as single-head attention. The difference is purely structural — same total Wqkv weights, partitioned into smaller q–k–v triples. Look at the two diagrams below. Both Wqkv matrices have the same height — same number of weight rows, same number of parameters. What changes is how that single tall block is sliced. • Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors. • Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples — each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left. The compute trade-off — kind of. Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H. • Single-head: seq × seq = 40² = 1600 dot products • Multi-head: seq × seq × H = 40² × 3 = 4800 dot products (3×) But each multi-head dot product is narrower — its inner dimension is head_dim instead of H × head_dim. So when you count actual scalar multiplications, the totals are equal: • Single-head: seq² × (H × head_dim) = 40² × 36 = 57600 • Multi-head: seq² × H × head_dim = 40² × 3 × 12 = 57600 Same FLOPs. Multi-head buys you H independent attention patterns at no extra weight cost and no extra arithmetic cost — it's the same total compute, sliced into H finer-grained heads.

35,448 просмотров

$ReLU vs Leaky ReLU 👉 = ReLU = ReLU is the default activation in modern deep learning — cheap to compute, and stable enough to train networks hundreds of layers deep. To see what it does, picture five boba tea shops on the same block — 𝚊, 𝚋, 𝚌, 𝚍, 𝚎 — each running their own books. Each value is a shop's monthly profit — receipts minus rent, ingredients, and wages. When profit is positive, the shop stays open and the owner pockets every dollar. When profit turns negative, the shop runs out of cash and shutters — the lights go off, the books are wiped to zero. ReLU is exactly that rule, applied one shop at a time. Read the diagram left to right. The first column is the raw value x — each shop's profit at month's end. The second column is the gate: 1 if the shop is open (x > 0), 0 if it has shuttered. The last column is the ReLU output: open shops pass their profit through untouched, while shuttered ones are zeroed out. Five rows means five parallel shops on the same block, each evaluated independently. That's why ReLU is called an element-wise activation: every neuron decides its own fate. = LeakyRelu = Plain ReLU wipes negative values to zero — clean, but a shop that shutters can never recover, since both its output and its gradient stay pinned at zero. This is the dying ReLU problem, and in deep networks it can quietly kill a meaningful fraction of the units. Leaky ReLU is the one-line fix: instead of shuttering, the shop files for Chapter 11 protection and keeps the lights on at reduced capacity. Its debt is restructured down to a fraction α (typically 0.1) — the rest is forgiven, and the shop is wounded, not killed. A small negative signal still flows through, so the gradient survives, and the shop can crawl back to life if a TikTok goes viral. Read the diagram left to right. The first column is the raw value x — each shop's profit at month's end. The second column is the leakage α — the fraction of the loss held over after restructuring (default 0.1, editable). The third column is the gate: 1 for shops still in the black, α for those operating under bankruptcy protection. The last column is the Leaky ReLU output: y = x · gate. Profitable shops pass through untouched; struggling ones shrink by a factor of α but still carry a sign. Five rows means five parallel shops, each evaluated independently. Like ReLU, this is an element-wise activation: every neuron's fate is decided on its own merits. #aibyhahd$

ReLU vs Leaky ReLU 👉 = ReLU = ReLU is the default activation in modern deep learning — cheap to compute, and stable enough to train networks hundreds of layers deep. To see what it does, picture five boba tea shops on the same block — 𝚊, 𝚋, 𝚌, 𝚍, 𝚎 — each running their own books. Each value is a shop's monthly profit — receipts minus rent, ingredients, and wages. When profit is positive, the shop stays open and the owner pockets every dollar. When profit turns negative, the shop runs out of cash and shutters — the lights go off, the books are wiped to zero. ReLU is exactly that rule, applied one shop at a time. Read the diagram left to right. The first column is the raw value x — each shop's profit at month's end. The second column is the gate: 1 if the shop is open (x > 0), 0 if it has shuttered. The last column is the ReLU output: open shops pass their profit through untouched, while shuttered ones are zeroed out. Five rows means five parallel shops on the same block, each evaluated independently. That's why ReLU is called an element-wise activation: every neuron decides its own fate. = LeakyRelu = Plain ReLU wipes negative values to zero — clean, but a shop that shutters can never recover, since both its output and its gradient stay pinned at zero. This is the dying ReLU problem, and in deep networks it can quietly kill a meaningful fraction of the units. Leaky ReLU is the one-line fix: instead of shuttering, the shop files for Chapter 11 protection and keeps the lights on at reduced capacity. Its debt is restructured down to a fraction α (typically 0.1) — the rest is forgiven, and the shop is wounded, not killed. A small negative signal still flows through, so the gradient survives, and the shop can crawl back to life if a TikTok goes viral. Read the diagram left to right. The first column is the raw value x — each shop's profit at month's end. The second column is the leakage α — the fraction of the loss held over after restructuring (default 0.1, editable). The third column is the gate: 1 for shops still in the black, α for those operating under bankruptcy protection. The last column is the Leaky ReLU output: y = x · gate. Profitable shops pass through untouched; struggling ones shrink by a factor of α but still carry a sign. Five rows means five parallel shops, each evaluated independently. Like ReLU, this is an element-wise activation: every neuron's fate is decided on its own merits. #aibyhahd

32,165 просмотров

Vector Database by Hand ✍️ Vector databases are revolutionizing how we search and analyze complex data. They have become the backbone of Retrieval Augmented Generation (#RAG). How do vector databases work? [1] Given ↳ A dataset of three sentences, each has 3 words (or tokens) ↳ In practice, a dataset may contain millions or billions of sentences. The max number of tokens may be tens of thousands (e.g., 32,768 mistral-7b). Process "how are you" [2] 🟨 Word Embeddings ↳ For each word, look up corresponding word embedding vector from a table of 22 vectors, where 22 is the vocabulary size. ↳ In practice, the vocabulary size can be tens of thousands. The word embedding dimensions are in the thousands (e.g., 1024, 4096) [3] 🟩 Encoding ↳ Feed the sequence of word embeddings to an encoder to obtain a sequence of feature vectors, one per word. ↳ Here, the encoder is a simple one layer perceptron (linear layer + ReLU) ↳ In practice, the encoder is a transformer or one of its many variants. [4] 🟩 Mean Pooling ↳ Merge the sequence of feature vectors into a single vector using "mean pooling" which is to average across the columns. ↳ The result is a single vector. We often call it "text embeddings" or "sentence embeddings." ↳ Other pooling techniques are possible, such as CLS. But mean pooling is the most common. [5] 🟦 Indexing ↳ Reduce the dimensions of the text embedding vector by a projection matrix. The reduction rate is 50% (4->2). ↳ In practice, the values in this projection matrix is much more random. ↳ The purpose is similar to that of hashing, which is to obtain a short representation to allow faster comparison and retrieval. ↳ The resulting dimension-reduced index vector is saved in the vector storage. [6] Process "who are you" ↳ Repeat [2]-[5] [7] Process "who am I" ↳ Repeat [2]-[5] Now we have indexed our dataset in the vector database. [8] 🟥 Query: "am I you" ↳ Repeat [2]-[5] ↳ The result is a 2-d query vector. [9] 🟥 Dot Products ↳ Take dot product between the query vector and database vectors. They are all 2-d. ↳ The purpose is to use dot product to estimate similarity. ↳ By transposing the query vector, this step becomes a matrix multiplication. [10] 🟥 Nearest Neighbor ↳ Find the largest dot product by linear scan. ↳ The sentence with the highest dot product is "who am I" ↳ In practice, because scanning billions of vectors is slow, we use an Approximate Nearest Neighbor (ANN) algorithm like the Hierarchical Navigable Small Worlds (HNSW).

191,994 просмотров

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. 👇Join the 'AI Math' community. Download xlsx.

101,555 просмотров

Full Fine-tuning vs. Freezing Layers. Interact 👉 and == Full Fine-tuning == A real network has many — three layers in this example, billions of parameters in a production model. What does fine-tuning look like when you update all of them? That’s full fine-tuning: continue training every weight in the pretrained network on your new task. Every layer’s W gets its own ΔW. Nothing is frozen — every parameter is in play. Think of an MLP as a chain of prerequisites leading to an advanced course. Layer 1 might be Linear Algebra, layer 2 Probability, layer 3 Advanced Machine Learning — each one building on what came before. Fine-tuning is what happens during graduate study: the foundations are already there from undergrad, so you’re not re-learning. Full fine-tuning is reviewing every prerequisite to see what new topics have appeared and what discoveries the field has made since the last time you sat through them. Effective — but exhausting. This diagram shows the same three-layer MLP twice, side by side. On the left, the pretrained network runs on input X: three weight matrices W₁, W₂, W₃, each followed by a ReLU activation. Full fine-tuning gives the model the most freedom to specialize. Every parameter can move — and every parameter that can move must be stored. But not every prerequisite needs revisiting. The further you go back in the chain, the less the material has changed since pretraining — the linear-algebra basics under your computer-vision course are largely the same as they ever were. The next page does exactly that: freeze the prerequisites that haven’t moved, and only refresh the advanced one closest to your specialization. == Freezing Layers == Full fine-tuning reviewed every prerequisite — Linear Algebra, Probability, Advanced ML — to refresh each subject with the latest topics. Effective, but exhausting. Then you realize something. The prerequisites haven’t actually changed that much. Linear Algebra is still Linear Algebra; the matrix decompositions you learned still hold. Probability is still Probability; the distributions and Bayes’ rule haven’t moved. Almost all the new material — the new ideas, the recent discoveries — lives in the advanced layer at the top. That’s freezing layers: keep the prerequisite layers fixed at their pretrained state, and only update the advanced one. In the diagram below, W1 and W2 — the foundational prerequisites — stay frozen. Only W3 — the layer closest to your task-specific output — gets a ΔW.

27,225 просмотров

Backpropagation by hand✍️ ~ spreadsheet. I designed this exercise to show it is possible to calculate backpropagation for a non-trivial, three-layer network by hand. p.s. I just started this community to share useful resources on AI math. 👇 Join the community.

121,825 просмотров

I built MatmulFlow ( — an interactive tool that makes matrix multiplication dimensions visual, part of my AI by Hand ✍️ series. Matrix multiplication dimensions are confusing. Which is the inner dimension? Columns of the first or rows of the second? And when you chain five multiplications together, it gets worse. The idea: represent matrices as rectangles. Shift the second matrix up and to the right. The edges that must align become obvious. The result fills in the remaining space. No memorization. You can see it. It extends to chains. Stack vertically for left-multiplication. Stack horizontally for right-multiplication. Resize any matrix and watch the dimensions "flow" through the entire chain. Give it a try!

26,269 просмотров

AlphaFold by hand✍️ Excel ~ I designed this exercise to show (1) MSA multi-head attention, (2) Pair triangular update, two key components of the EvoFormer architecture.👇Join the AI Math community. Download xlsx.

104,990 просмотров

[Discrete Fourier Transform] by Hand ✍️ In signal processing, the Discrete Fourier Transform (DFT) is no doubt the most important method. But the math involved is extremely complex, literally, involving a summation over a complex number term e^(-iwt). I developed this exercise to demonstrate that underneath such complexity, DFT is just a series of matrix multiplications you can calculate by hand. ✍️ Once you see that, it should not surprise you that a deep neural network, which is also a series of matrix multiplications, with activation functions in-between, can learn to perform DFT to process and analyze signals so effectively. How does DFT work? [1] Given ↳ Signals A, B, and C in the 🟧 frequency domain: ◦ A = cos(w) + 2cos(2w) ◦ B = cos(w) + cos(3w) + cos(4w) ◦ C = -cos(2w) + cos(3w) ◦ Each signal is a weighed sum of four cosine waves at frequencies 1w, 2w, 3w, and 4w. ◦ We will apply Inverse DFT to convert the signals to time domain representations, and then demonstrate DFT can convert back to their original frequency domain representations. ↳ Signal X in the 🟩 time domain. X is sampled at 10 time points 1t, 2t, …, 10t: ◦ X = [-2.5, -1.8, 3, -0.7, -1.0, -0.7, 3, -1.8, -2.5, 5] ◦ Suppose X is also a weighted sum of the same four cosine waves, but we don’t already know their weights. We will apply DFT to discover them. [2] 🟧 Frequency Matrix (F) ↳ Write the coefficients of A, B, C as a matrix F. Each signal is a row. Each frequency is a column. ↳ A → [1, 2, 0, 0] ↳ B → [1, 0, 1, 1] ↳ C → [0, 1-, 1, 0] [3] Cosine → Discrete ↳ Sample from the continuous cosine waves at discrete time points 1t, 2t, 3t, to 10t. [4] Cosine Matrix (W) ↳ Write the samples as a matrix, Each frequency is a row. Each time point is a column. [5] Inverse DFT: 🟧 Frequency → 🟩 Time ↳ Multiply the frequency matrix F and the cosine matrix W. ↳ The meaning of this multiplication is to linearly combine the four cosine waves (rows in W) into time-domain signals (rows in T) using the weights specified in F. ↳ The result is matrix T, which are signals A, B, C converted to the time domain. Each signal is a row. Each time point is a column. [6] Transpose ↳ Transpose T, converting each signal’s time domain representation from a row to a column. [7] DFT: 🟩 Time → 🟧 Frequency ↳ Multiply the cosine matrix W with the transpose of matrix T. ↳ The purpose of this multiplication is to take a dot-product between each time-domain signal (columns in the transpose of T) and each cosine wave (rows in W), which has the effect of projecting the signal onto a cosine wave to determine how much they are correlated. Zero means not correlated at all. ↳ The result is an intermediate version of the “recovered” frequency matrix where each column corresponds to a signal and each row corresponds to a frequency. ↳ Compared to the original frequency matrix F, this intermediate matrix has non-zero weights in the correct places, but scaled up by a factor of 5 (n/2, n=10). For example, signal A, originally [1,2,0,0], is recovered at [5,10,0,0]. [8] Scale ↳ Multiply each value by 2/n = 1/5 to scale down the intermediate matrix to match the magnitude of the original frequency matrix F. [9] Transpose ↳ Transpose the recovered frequency matrix back to the same orientation of the original frequency matrix F. ↳ Like magic 🪄, the result is identical to the original F, which means DFT successfully recovered the frequency components of signals A, B, C. [10] Apply DFT to X: 🟩 Time → 🟧 Frequency ↳ Now that we have some confidence in DFT’s ability to recover frequency components, we apply DFT to X’s time-domain representation by multiplying W with X. ↳ The result is the an intermediate matrix. [11] Scale ↳ Similarly, we scale down by a factor of 5 to obtain the recovered frequency components of X (a column). [12] Transpose ↳ Similarly, we transpose the recovered column to row to match the orientation of the frequency matrix. ↳ Using the coefficients [0,0,3,2], we can write the equation of X as 3cos(3w) + 2cos(4w). Notes: I hope this by hand exercise helps you understand the essence of DFT. But there is more technical details, such as: • Sine: The complete DFT math also includes sine waves that follow a similar calculation process. • Phase: Here, we assume all the cosine waves are aligned at the origin, namely, phase is 0. If a phase p is added, for example, cos(w+p), we will need to calculate the sine component and use their ratio to figure out what p is. • Magnitude: If phase is not zero, the magnitude will need to be calculated by combining both cosine and sine terms.

116,622 просмотров

Transformer by hand✍️ Excel ~ I designed this exercise to show the core math of a Transformer model is to combine columns (attention), combine rows (feed forward), and repeat.👇Join the 'AI Math' community. 👇Download xlsx.

66,770 просмотров

[LSTM] by Hand ✍️ LSTMs have been the most effective architecture to process long sequences of data, until our world was taken over by the Transformers. LSTMs belong to the broader family of recurrent neural network (RNNs) that process data sequentially in a recurrent manner. Transformers, on the other hand, abandon recurrence and use self-attention instead to process data concurrently in parallel. Recently, there is renewed interest in recurrence as people realized self-attention doesn’t scale to extremely long sequences, like hundreds of thousands of tokens. Mamba is a good example to bring back recurrence. All of a sudden, it is cool to study LSTMs. How do LSTMs work? [1] Given ↳ 🟨 Input sequence X1, X2, X3 (d = 3) ↳ 🟩 Hidden state h (d = 2) ↳ 🟦 Memory C (d = 2) ↳ Weight matrices Wf, Wc, Wi, Wo Process t = 1 [2] Initialize ↳ Randomly set the previous hidden state h0 to [1, 1] and memory cells C0 to [0.3, -0.5] [3] Linear Transform ↳ Multiply the four weight matrices with the concatenation of current input (X1) and the previous hidden state (h0). ↳ The results are feature values, each is a linear combination of the current input and hidden state. [4] Non-linear Transform ↳ Apply sigmoid σ to obtain gate values (between 0 and 1). • Forget gate (f1): [-4, -6] → [0, 0] • Input gate (i1): [6, 4] → [1, 1] • Output gate (o1): [4, -5] → [1, 0] ↳ Apply tanh to obtain candidate memory values (between -1 and 1) • Candidate memory (C’1): [1, -6] → [0.8, -1] [5] Update Memory ↳ Forget (C0 .* f1): Element-wise multiply the current memory with forget gate values. ↳ Input (C’1 .* o1): Element-wise multiply the “candidate” memory with input gate values. ↳ Update the memory to C1 by adding the two terms above: C0 .* f1 + C’1 .* o1 = C1 [6] Candiate Output ↳ Apply tanh to the new memory C1 to obtain candidate output o’1. [0.8, -1] → [0.7, -0.8] [7] Update Hidden State ↳ Output (o’1 .* o1 → h1): Element-wise multiply the candidate output with the output gate. ↳ The result is updated hidden state h1 ↳ Also, it is the first output. Process t = 2 [8] Initialize ↳ Copy previous hidden state h1 and memory C1 [9] Linear Transform ↳ Repeat [3] [10] Update Memory (C2) ↳ Repeat [4] and [5] [11] Update Hidden State (h2) ↳ Repeat [6] and [7] Process t = 3 [12] Initialize ↳ Copy previous hidden state h2 and memory C2 [13] Linear Transform ↳ Repeat [3] [14] Update Memory (C3) ↳ Repeat [4] and [5] [15] Update Hidden State (h3) ↳ Repeat [6] and [7]

72,891 просмотров

$[Backpropagation] by Hand✍️ [1] Forward Pass ↳ Given a multi layer perceptron (3 levels), an input vector X, predictions Y^{Pred} = [0.5, 0.5, 0], and ground truth label Y^{Target} = [0, 1, 0]. [2] Backpropagation ↳ Insert cells to hold our calculations. [3] Layer 3 - Softmax (blue) ↳ Calculate ∂L / ∂z3 directly using the simple equation: Y^{Pred} - Y^{Target} = [0.5, -0.5, 0]. ↳ This simple equation is the benefit of using Softmax and Cross Entropy Loss together. [4] Layer 3 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W3 and ∂L / ∂b3 by multiplying ∂L / ∂z3 and [ a2 | 1 ]. [5] Layer 2 - Activations (green) ↳ Calculate ∂L / ∂a2 by multiplying ∂L / ∂z3 and W3. [6] Layer 2 - ReLU (blue) ↳ Calculate ∂L / ∂z2 by multiplying ∂L / ∂a2 with 1 for positive values and 0 otherwise. [7] Layer 2 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W2 and ∂L / ∂b2 by multiplying ∂L / ∂z2 and [ a1 | 1 ]. [8] Layer 1 - Activations (green) ↳ Calculate ∂L / ∂a1 by multiplying ∂L / ∂z2 and W2. [9] Layer 1 - ReLU (blue) ↳ Calculate ∂L / ∂z1 by multiplying ∂L / ∂a1 with 1 for positive values and 0 otherwise. [10] Layer 1 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W1 and ∂L / ∂b1 by multiplying ∂L / ∂z1 and [ x | 1 ]. [11] Gradient Descent ↳ Update weights and biases (typically a learning rate is applied here). 💡 Matrix Multiplication is All You Need: Just like in the forward pass, backpropagation is all about matrix multiplications. You can definitely do everything by hand as I demonstrated in this exercise, albeit slow and imperfect. This is why GPU's ability to multiply matrices efficiently plays such an important role in the deep learning evolution. This is why NVIDIA is now close to $1 trillion in valuation. 💡Exploding Gradients: We can already see the gradients are getting larger as we back-propagate up, even in this simple 3-layer network. This motivates using methods like skip connections to handle exploding (or diminishing) gradients as in the ResNet. I did the calculations entirely by hand. Please let me know if you spot any error or have any questions!$

[Backpropagation] by Hand✍️ [1] Forward Pass ↳ Given a multi layer perceptron (3 levels), an input vector X, predictions Y^{Pred} = [0.5, 0.5, 0], and ground truth label Y^{Target} = [0, 1, 0]. [2] Backpropagation ↳ Insert cells to hold our calculations. [3] Layer 3 - Softmax (blue) ↳ Calculate ∂L / ∂z3 directly using the simple equation: Y^{Pred} - Y^{Target} = [0.5, -0.5, 0]. ↳ This simple equation is the benefit of using Softmax and Cross Entropy Loss together. [4] Layer 3 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W3 and ∂L / ∂b3 by multiplying ∂L / ∂z3 and [ a2 | 1 ]. [5] Layer 2 - Activations (green) ↳ Calculate ∂L / ∂a2 by multiplying ∂L / ∂z3 and W3. [6] Layer 2 - ReLU (blue) ↳ Calculate ∂L / ∂z2 by multiplying ∂L / ∂a2 with 1 for positive values and 0 otherwise. [7] Layer 2 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W2 and ∂L / ∂b2 by multiplying ∂L / ∂z2 and [ a1 | 1 ]. [8] Layer 1 - Activations (green) ↳ Calculate ∂L / ∂a1 by multiplying ∂L / ∂z2 and W2. [9] Layer 1 - ReLU (blue) ↳ Calculate ∂L / ∂z1 by multiplying ∂L / ∂a1 with 1 for positive values and 0 otherwise. [10] Layer 1 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W1 and ∂L / ∂b1 by multiplying ∂L / ∂z1 and [ x | 1 ]. [11] Gradient Descent ↳ Update weights and biases (typically a learning rate is applied here). 💡 Matrix Multiplication is All You Need: Just like in the forward pass, backpropagation is all about matrix multiplications. You can definitely do everything by hand as I demonstrated in this exercise, albeit slow and imperfect. This is why GPU's ability to multiply matrices efficiently plays such an important role in the deep learning evolution. This is why NVIDIA is now close to $1 trillion in valuation. 💡Exploding Gradients: We can already see the gradients are getting larger as we back-propagate up, even in this simple 3-layer network. This motivates using methods like skip connections to handle exploding (or diminishing) gradients as in the ResNet. I did the calculations entirely by hand. Please let me know if you spot any error or have any questions!

64,645 просмотров

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. I also made a medium version.👇Join the 'AI Math' community. Download xlsx.

54,482 просмотров

Retrieval Weighting RAG by hand ✍️ + Langflow . I am designing a series of exercises to teach advanced RAG techniques. This is No. 5. Previous exercises are in the comment.

48,408 просмотров

Videos

LIVE

1.2k

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Streaming Now

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

HD live stream

Exclusive private shows

1.2k viewers online

Current Status

Live

Private Show

Join now for exclusive access

Free preview available • Premium content

Live Cam

Tom Yeh

Shorts

At MIT, the only course I ever dropped was signal processing. The DFT math was too intimidating. It’s so easy to just type fft() in MATLAB and move on. Years later, I finally did DFT by hand. ✍️ If you are also afraid of DFT, I hope this helps! ⬇️ Download:

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. 👇Join the 'AI Math' community. Download xlsx.

Backpropagation by hand✍️ ~ spreadsheet. I designed this exercise to show it is possible to calculate backpropagation for a non-trivial, three-layer network by hand. p.s. I just started this community to share useful resources on AI math. 👇 Join the community.

AlphaFold by hand✍️ Excel ~ I designed this exercise to show (1) MSA multi-head attention, (2) Pair triangular update, two key components of the EvoFormer architecture.👇Join the AI Math community. Download xlsx.

Transformer by hand✍️ Excel ~ I designed this exercise to show the core math of a Transformer model is to combine columns (attention), combine rows (feed forward), and repeat.👇Join the 'AI Math' community. 👇Download xlsx.

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. I also made a medium version.👇Join the 'AI Math' community. Download xlsx.

Retrieval Weighting RAG by hand ✍️ + Langflow . I am designing a series of exercises to teach advanced RAG techniques. This is No. 5. Previous exercises are in the comment.

Videos

Watch Anya Live

Transformer by hand ✍️ in Excel ~ I just released my first-ever "Full-Stack" implementation of the Transformer model. 👇Download xlsx to give it a try!

I flipped through 300 pages of deep learning math puzzles by hand ✍️ This is the proof copy of the "Deep Learning Math Workbook" I made to help you learn and practice 12 foundational math concepts below👇

My students built five models during the final exam. This solution was submitted by Shih-Yu Ma (Shih-Yu Ma (馬世宇)). He finished in 97 minutes with 221 edits. The five models are: 👇

I was an English-as-Second-Language learner when I moved to Canada with my family many years ago. I remember doing endless fill-in-the-blank exercises to practice English. Deep Learning Math is also a language. So I thought: why not use the same method to practice this math language? See more 👉

DeepSeek by hand ✍️ in Excel. You can explore and study in your web browser 👉

Self-Attention by hand ✍️ Excel ~ I designed this exercise for students to practice the QKV math. I also created a medium and a large version to show how the attention matrix grows quadratically as the sequence gets longer. 👇Join the 'AI Math' community. Download xlsx.

I calculated a poster-size Transformer model by hand✍️

Matrix Multiplication on GPU/TPU by hand ✍️ I drew 91 frames to show how to divide large matrices into "tiles" to use accelerators such as GPU/TPU. If you like to learn more, check out our weekly workshops. Link 👇

12 deep learning architectures to study in 2025, by hand✍️

Anna Rahn's video demonstration of Multi-Layer Perceptron (MLP) by hand ✍️ Anna made this short video for my computer vision course in the spring semester. Together with the spreadsheet exercise I posted yesterday, I hope they help more people understand deep neural networks!

Can you calculate a Vision Transformer (ViT) by hand? ✍️ If you can, come join me next week at the LLM Paper Club. I will extend ViT to Llama 1 -&gt; 2 -&gt; 3 -&gt; 4 live. Event link below.

How to multiply two matrices on a GPU or TPU by hand ✍️? Learn more at my keynote lecture next week 👉 (hosted by Together AI) I drew 91 frames in total to create this animation. 💪

Can you calculate a Vision Transformer (ViT) by hand? ✍️ If you can, come join me next week at the LLM Paper Club. I will extend ViT to Llama 1 -> 2 -> 3 -> 4 live. Event link below.