
Tom Yeh
@ProfTomYeh • 55,997 subscribers
CS Prof | AI by Hand ✍️ | @CUBoulder
Shorts
Videos

In 2024, only a handful of people solved the 300 deep learning math puzzles I developed. Reginaldo Cunha was among the first. In 2025, many of you who are reading this are among the first thousand to do the same thing. Congrats! 🙌 In 2026, I believe your stories will inspire millions more to learn AI by hand! ✍️
Tom Yeh156,683 views • 6 months ago

I still remember back in grad school. My friend in NLP used to show off, bragging that he had LSTM all figured out. I envied him. Fortunately, my field was Computer Vision. I could survive just knowing my SVMs. In 2024, the inventor of LSTM himself is finally back with the extension: xLSTM. Here's my Excel implementation. Not for the faint of heart. Download: I guess I can brag to my NLP friend now. 😉
Tom Yeh110,698 views • 7 months ago

llm.c by Hand✍️ C programming + matrix multiplication by hand This combination is perhaps as low as we can get to explain how the Transformer works. Special thanks to Andrej Karpathy for encouraging early feedback and tetsuo.agenc //: 👾 for helping me understand the pragma magic. I hope this exercise can help people peak further into the LLM black box.
Tom Yeh302,624 views • 2 years ago

Evolution of Deep Learning by Hand ✍️ As my tribute to Geoff Hinton's Nobel Prize, I drew this animation to illustrate the key idea behind Hinton's major contributions to deep learning over the years, with artistic liberty. ---- 100% original, made by hand ✍️ Join 40k readers of my newsletter:
Tom Yeh110,628 views • 7 months ago

I was an English-as-Second-Language learner when I moved to Canada with my family many years ago. I remember doing endless fill-in-the-blank exercises to practice English. Deep Learning Math is also a language. So I thought: why not use the same method to practice this math language? See more 👉
Tom Yeh87,792 views • 6 months ago

SORA by Hand ✍️ OpenAI’s #SORA took over the Internet when it was announced earlier this year. The technology behind Sora is the Diffusion Transformer (DiT) developed by William Peebles and Shining Xie. How does DiT work? 𝗚𝗼𝗮𝗹: Generate a video conditioned by a text prompt and a series of diffusion steps [1] Given ↳ Video ↳ Prompt: "sora is sky" ↳ Diffusion step: t = 3 [2] Video → Patches ↳ Divide all pixels in all frames into 4 spacetime patches [3] Visual Encoder: Pixels 🟨 → Latent 🟩 ↳ Multiply the patches with weights and biases, followed by ReLU ↳ The result is a latent feature vector per patch ↳ The purpose is dimension reduction from 4 (2x2x1) to 2 (2x1). ↳ In the paper, the reduction is 196,608 (256x256x3)→ 4096 (32x32x4) [4] ⬛ Add Noise ↳ Sample a noise according to the diffusion time step t. Typically, the larger the t, the smaller the noise. ↳ Add the Sampled Noise to latent features to obtain Noised Latent. ↳ The goal is to purposely add noise to a video and ask the model to guess what that noise is. ↳ This is analogous to training a language model by purposely deleting a word in a sentence and ask the model to guess what the deleted word was. [5-7] 🟪 Conditioning by Adaptive Layer Norm [5] Encode Conditions ↳ Encode "sora is sky" into a text embedding vector [0,1,-1]. ↳ Encode t = 3 to as a binary vector [1,1]. ↳ Concatenate the two vectors in to a 5D column vector. [6] Estimate Scale/Shift ↳ Multiply the combined vector with weights and biases ↳ The goal is to estimate the scale [2,-1] and shift [-1,5]. ↳ Copy the result to (X) and (+) [7] Apply Scale/Sift ↳ Scale the noised latent by [2,-1] ↳ Shifted the scaled noised latent by [-1, 5] ↳ The result is "conditioned" noise latent. [8-10] Transformer [8] Self-Attention ↳ Feed the conditioned noised latent to Query-Key function to obtain a self-attention matrix ↳ Value is omitted for simplicity [9] Attention Pooling ↳ Multiply the conditioned noised latent with the self-attention matrix ↳ The result are attention weighted features [10] Pointwise Feed Forward Network ↳ Multiply the attention weighted features with weights and biases ↳ The result is the Predicted Noise 🏋️♂️ 𝗧𝗿𝗮𝗶𝗻 [11] ↳ Calculate MSE loss gradients by taking the different between the Predicted Noise and the Sampled Noise (ground truth). ↳ Use the loss gradients to kick off backpropagation to update all learnable parameters (red borders) ↳ Note the visual encoder and decoder's parameters are frozen (blue borders) 🎨 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲 (𝗦𝗮𝗺𝗽𝗹𝗲) [12] Denoise ↳ Subtract the predicted noise from the noised latent to obtain the noise-free latent [13] Visual Decoder: Latent 🟩 → Pixels 🟨 ↳ Multiply the patches with weights and biases, followed by ReLU [14] Patches → Video ↳ Rearrange patches into a sequence of video frames.
Tom Yeh238,065 views • 2 years ago

DeepSeek by hand ✍️ in Excel. You can explore and study in your web browser 👉
Tom Yeh164,796 views • 1 year ago

GPU by hand ✍️ I drew 42 frames to show how a GPU speeds up an array operation of 8 elements in parallel over 4 threads in 2 clock cycles. Below is an overview: CPU • It has one core. • Its global memory has 120 locations (0-119). • To use the GPU, it needs to copy data from the global memory to the GPU. • After GPU is done, it will copy the results back. GPU • It has four cores to run four threads (0-3). • It has a register file of 28 locations (0-27) • This register file has four banks (0-3). • All threads share the same register file. • But they must read/write using the four banks. • Each bank allows 2 reads (Read 0, Read 1) and 1 write in a single clock cycle.
Tom Yeh101,788 views • 11 months ago

Self-Attention by hand ✍️ Excel ~ I designed this exercise for students to practice the QKV math. I also created a medium and a large version to show how the attention matrix grows quadratically as the sequence gets longer. 👇Join the 'AI Math' community. Download xlsx.
Tom Yeh125,518 views • 1 year ago

[Transformer] by Hand✍️📺 5-minute Video Tutorial Anna Rahn made this short video to explain the Transformer exercise for my Computer Vision course last spring. In 5 minutes, she demonstrates the key calculations of the Transformer by hand with pen and paper! Anna is a fantastic student. I am lucky to have her in my lab!
Tom Yeh133,347 views • 1 year ago

Anna Rahn's video demonstration of Multi-Layer Perceptron (MLP) by hand ✍️ Anna made this short video for my computer vision course in the spring semester. Together with the spreadsheet exercise I posted yesterday, I hope they help more people understand deep neural networks!
Tom Yeh120,362 views • 1 year ago