正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

[Backpropagation] by Hand✍️ [1] Forward Pass ↳ Given a multi layer perceptron (3 levels), an input vector X, predictions Y^{Pred} = [0.5, 0.5, 0], and ground truth label Y^{Target} = [0, 1, 0]. [2] Backpropagation ↳ Insert cells to hold our calculations. [3] Layer 3 - Softmax (blue) ↳... Calculate ∂L / ∂z3 directly using the simple equation: Y^{Pred} - Y^{Target} = [0.5, -0.5, 0]. ↳ This simple equation is the benefit of using Softmax and Cross Entropy Loss together. [4] Layer 3 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W3 and ∂L / ∂b3 by multiplying ∂L / ∂z3 and [ a2 | 1 ]. [5] Layer 2 - Activations (green) ↳ Calculate ∂L / ∂a2 by multiplying ∂L / ∂z3 and W3. [6] Layer 2 - ReLU (blue) ↳ Calculate ∂L / ∂z2 by multiplying ∂L / ∂a2 with 1 for positive values and 0 otherwise. [7] Layer 2 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W2 and ∂L / ∂b2 by multiplying ∂L / ∂z2 and [ a1 | 1 ]. [8] Layer 1 - Activations (green) ↳ Calculate ∂L / ∂a1 by multiplying ∂L / ∂z2 and W2. [9] Layer 1 - ReLU (blue) ↳ Calculate ∂L / ∂z1 by multiplying ∂L / ∂a1 with 1 for positive values and 0 otherwise. [10] Layer 1 - Weights (orange) & Biases (black) ↳ Calculate ∂L / ∂W1 and ∂L / ∂b1 by multiplying ∂L / ∂z1 and [ x | 1 ]. [11] Gradient Descent ↳ Update weights and biases (typically a learning rate is applied here). 💡 Matrix Multiplication is All You Need: Just like in the forward pass, backpropagation is all about matrix multiplications. You can definitely do everything by hand as I demonstrated in this exercise, albeit slow and imperfect. This is why GPU's ability to multiply matrices efficiently plays such an important role in the deep learning evolution. This is why NVIDIA is now close to $1 trillion in valuation. 💡Exploding Gradients: We can already see the gradients are getting larger as we back-propagate up, even in this simple 3-layer network. This motivates using methods like skip connections to handle exploding (or diminishing) gradients as in the ResNet. I did the calculations entirely by hand. Please let me know if you spot any error or have any questions!show more

Tom Yeh

56,015 subscribers

64,645 次观看 • 2 年前 •via X (Twitter)

健康养生科学技术教育

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Batch Normalization by hand ✍️ ~ 7 steps walkthrough below Batch normalization is common practice for improving training and achieving faster convergence. It sounds simple. But it is often misunderstood. 🤔 Does batch normalization involve trainable parameters, tunable hyper-parameters, or both? 🤔 Is batch normalization applied to inputs, features, weights, biases, or outputs? 🤔 How is batch normalization different from layer normalization? So I drew and calculated one entirely by hand. Goal: normalize a mini-batch of 4 examples to mean 0 and variance 1, then let the network scale it back. = 1. Given = A mini-batch of 4 training examples, each with 3 features. = 2. Linear layer = Let us multiply by the weights and add the biases. Batch norm sits after this, which answers the second question: what gets normalized is features, not inputs, weights or biases. = 3. ReLU = We apply the activation, and -2 becomes 0. Negative values are suppressed before any statistic is taken. = 4. Batch statistics = Let us compute the sum, mean, variance and standard deviation, one row at a time. A row is a feature and the four columns are the four examples, so every number here measures one feature against the rest of the batch. That is the "batch" in batch normalization, and it is exactly what layer normalization does not do. The statistics are rounded to whole numbers, which is what keeps the rest of the page doable in pen. = 5. Shift to mean 0 = We subtract the mean, in green. The four values in each feature now average to zero. = 6. Scale to variance 1 = Let us divide by the standard deviation, in orange. Each feature now has variance one, whatever scale it arrived at. = 7. Scale and shift = We multiply by a linear transformation and pass the result on. The diagonal and the last column are trainable, so having just forced every feature to mean 0 and variance 1, we hand the network the means to undo it. The outputs: Mean of each feature = [2, 1, 2] Std dev of each feature = [1, 1, 2] To the next layer = [2, -2, 2, 0], [-3, 3, 6, -3], [2, 0, 1, 2] The answers: 🤔 Both. The scale and shift are trainable, the statistics are not. Epsilon and the momentum on the running statistics are the hyper-parameters, and one mini-batch by hand needs neither. 🤔 Features, after the linear layer, not inputs, weights or biases. 🤔 Batch norm measures across the batch, one feature at a time. Layer norm measures across the features, one example at a time. 💾 Save this post!

Batch Normalization by hand ✍️ ~ 7 steps walkthrough below Batch normalization is common practice for improving training and achieving faster convergence. It sounds simple. But it is often misunderstood. 🤔 Does batch normalization involve trainable parameters, tunable hyper-parameters, or both? 🤔 Is batch normalization applied to inputs, features, weights, biases, or outputs? 🤔 How is batch normalization different from layer normalization? So I drew and calculated one entirely by hand. Goal: normalize a mini-batch of 4 examples to mean 0 and variance 1, then let the network scale it back. = 1. Given = A mini-batch of 4 training examples, each with 3 features. = 2. Linear layer = Let us multiply by the weights and add the biases. Batch norm sits after this, which answers the second question: what gets normalized is features, not inputs, weights or biases. = 3. ReLU = We apply the activation, and -2 becomes 0. Negative values are suppressed before any statistic is taken. = 4. Batch statistics = Let us compute the sum, mean, variance and standard deviation, one row at a time. A row is a feature and the four columns are the four examples, so every number here measures one feature against the rest of the batch. That is the "batch" in batch normalization, and it is exactly what layer normalization does not do. The statistics are rounded to whole numbers, which is what keeps the rest of the page doable in pen. = 5. Shift to mean 0 = We subtract the mean, in green. The four values in each feature now average to zero. = 6. Scale to variance 1 = Let us divide by the standard deviation, in orange. Each feature now has variance one, whatever scale it arrived at. = 7. Scale and shift = We multiply by a linear transformation and pass the result on. The diagonal and the last column are trainable, so having just forced every feature to mean 0 and variance 1, we hand the network the means to undo it. The outputs: Mean of each feature = [2, 1, 2] Std dev of each feature = [1, 1, 2] To the next layer = [2, -2, 2, 0], [-3, 3, 6, -3], [2, 0, 1, 2] The answers: 🤔 Both. The scale and shift are trainable, the statistics are not. Epsilon and the momentum on the running statistics are the hyper-parameters, and one mini-batch by hand needs neither. 🤔 Features, after the linear layer, not inputs, weights or biases. 🤔 Batch norm measures across the batch, one feature at a time. Layer norm measures across the features, one example at a time. 💾 Save this post!

Tom Yeh

20,518 次观看 • 10 天前

MLP in PyTorch by hand ✍️ ~ 7 steps walkthrough below Goal: fill in every blank in the PyTorch code to build a multi-layer perceptron. 1. Given Let us start with a code template on the left and the network it is supposed to build on the right. Every blank in the code can be worked out from the picture. 2. Linear layer We count: 3 features in, 4 features out. So the weight matrix is 4 by 3. There is an extra column for the biases, which means bias = T. 3. ReLU Let us apply the activation. ReLU crosses out the negatives, so -1 becomes 0. 4. Linear layer The input size is 4, because that is what the previous layer put out. The output size is 2. A 2 by 4 weight matrix, and this time no extra column, so bias = F. 5. ReLU We cross out the negatives again. 6. Linear layer Two features in, five out. A 5 by 2 weight matrix, with a bias column, so bias = T. 7. Sigmoid Let us finish. Sigmoid squashes the raw scores (3, 0, -2, 5, -5) into probabilities between 0 and 1. You have just implemented a three-layer deep neural network by hand. ✍️ == Story == Three years ago I gave this exercise to my students, to connect the code to the math. They found it odd. Every other AI course they were taking lived inside a Jupyter notebook, and here I was handing out paper. Three years later, my colleagues are the ones rushing to move their materials to paper. The exercise has not changed. Paper still asks the one thing a notebook lets you skip: do you actually understand what the code is doing? If you can tell me why the weight matrix is 4 by 3, and why bias is F on the second layer, you understand nn.Linear better than someone who has been copy-pasting it for a year. 💾 Save this post! #AIbyHand #PyTorch #DeepLearning

MLP in PyTorch by hand ✍️ ~ 7 steps walkthrough below Goal: fill in every blank in the PyTorch code to build a multi-layer perceptron. 1. Given Let us start with a code template on the left and the network it is supposed to build on the right. Every blank in the code can be worked out from the picture. 2. Linear layer We count: 3 features in, 4 features out. So the weight matrix is 4 by 3. There is an extra column for the biases, which means bias = T. 3. ReLU Let us apply the activation. ReLU crosses out the negatives, so -1 becomes 0. 4. Linear layer The input size is 4, because that is what the previous layer put out. The output size is 2. A 2 by 4 weight matrix, and this time no extra column, so bias = F. 5. ReLU We cross out the negatives again. 6. Linear layer Two features in, five out. A 5 by 2 weight matrix, with a bias column, so bias = T. 7. Sigmoid Let us finish. Sigmoid squashes the raw scores (3, 0, -2, 5, -5) into probabilities between 0 and 1. You have just implemented a three-layer deep neural network by hand. ✍️ == Story == Three years ago I gave this exercise to my students, to connect the code to the math. They found it odd. Every other AI course they were taking lived inside a Jupyter notebook, and here I was handing out paper. Three years later, my colleagues are the ones rushing to move their materials to paper. The exercise has not changed. Paper still asks the one thing a notebook lets you skip: do you actually understand what the code is doing? If you can tell me why the weight matrix is 4 by 3, and why bias is F on the second layer, you understand nn.Linear better than someone who has been copy-pasting it for a year. 💾 Save this post! #AIbyHand #PyTorch #DeepLearning

Tom Yeh

13,318 次观看 • 17 天前

[Graph Convolutional Network] by hand ✍️ Graph Convolutional Networks (GCNs), introduced by Thomas Kipf and Max Welling in 2017, have emerged as a powerful tool in the analysis and interpretation of data structured as graphs. This exercise demonstrates how GCN works in a simple application: binary classification. -- Goal -- Predict if a node in a graph is X. -- Architecture -- 🟪 Graph Convolutional Network (GCN) 1. GCN1(4,3) 2. GCN2(3,3) 🟦 Fully Connected Network (FCN) 1. Linear1(3,5) 2. ReLU 3. Linear2(5,1) 4. Sigmoid Simplications: • Adjacent matrices are not normalized. • ReLU is applied to messages directly. -- Walkthrough -- [1] Given ↳ A graph with five nodes A, B, C, D, E [2] 🟩 Adjacency Matrix: Neighbors ↳ Add 1 for each edge to neighbors ↳ Repeat in both directions (e.g., A->C, C->A) ↳ Repeat for both GCN layers [3] 🟩 Adjacency Matrix: Self ↳ Add 1's for each self loop ↳ Equivalent to adding the identity matrix ↳ Repeat for both GCN layers [4] 🟪 GCN1: Messages ↳ Multiply the node embeddings 🟨 with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is one message per node [5] 🟪 GCN1: Pooling ↳ Multiply the messages with the adjacent matrix ↳ The purpose is the pool messages from each node's neighbors as well as from the node itself. ↳ The result is a new feature per node [6] 🟪 GCN1: Visualize ↳ For node 1, visualize how messages are pooled to obtain a new feature for better understanding ↳ [3,0,1] + [1,0,0] = [4,0,1] [7] 🟪 GCN2: Messages ↳ Multiply the node features with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is one message per node [8] 🟪 GCN2: Pooling ↳ Multiply the messages with the adjacent matrix ↳ The result is a new feature per node [9] 🟪 GCN2: Visualize ↳ For node 3, visualize how messages are pooled to obtain a new feature for better understanding ↳ [1,2,4] + [1,3,5] + [0,0,1] = [2,5,10] [10] 🟦 FCN: Linear 1 + ReLU ↳ Multiply node features with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is a new feature per node ↳ Unlike in GCN layers, no messages from other nodes are included. [11] 🟦 FCN: Linear 2 ↳ Multiply node features with weights and biases [12] 🟦 FCN: Sigmoid ↳ Apply the Sigmoid activation function ↳ The purpose is to obtain a probability value for each node ↳ One way to calculate Sigmoid by hand ✍️ is to use the approximation below: • >= 3 → 1 • 0 → 0.5 • <= -3 → 0 -- Outputs -- A: 0 (Very unlikely) B: 1 (Very likely) C: 1 (Very likely) D: 1 (Very likely) E: 0.5 (Neutral)

[Graph Convolutional Network] by hand ✍️ Graph Convolutional Networks (GCNs), introduced by Thomas Kipf and Max Welling in 2017, have emerged as a powerful tool in the analysis and interpretation of data structured as graphs. This exercise demonstrates how GCN works in a simple application: binary classification. -- Goal -- Predict if a node in a graph is X. -- Architecture -- 🟪 Graph Convolutional Network (GCN) 1. GCN1(4,3) 2. GCN2(3,3) 🟦 Fully Connected Network (FCN) 1. Linear1(3,5) 2. ReLU 3. Linear2(5,1) 4. Sigmoid Simplications: • Adjacent matrices are not normalized. • ReLU is applied to messages directly. -- Walkthrough -- [1] Given ↳ A graph with five nodes A, B, C, D, E [2] 🟩 Adjacency Matrix: Neighbors ↳ Add 1 for each edge to neighbors ↳ Repeat in both directions (e.g., A->C, C->A) ↳ Repeat for both GCN layers [3] 🟩 Adjacency Matrix: Self ↳ Add 1's for each self loop ↳ Equivalent to adding the identity matrix ↳ Repeat for both GCN layers [4] 🟪 GCN1: Messages ↳ Multiply the node embeddings 🟨 with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is one message per node [5] 🟪 GCN1: Pooling ↳ Multiply the messages with the adjacent matrix ↳ The purpose is the pool messages from each node's neighbors as well as from the node itself. ↳ The result is a new feature per node [6] 🟪 GCN1: Visualize ↳ For node 1, visualize how messages are pooled to obtain a new feature for better understanding ↳ [3,0,1] + [1,0,0] = [4,0,1] [7] 🟪 GCN2: Messages ↳ Multiply the node features with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is one message per node [8] 🟪 GCN2: Pooling ↳ Multiply the messages with the adjacent matrix ↳ The result is a new feature per node [9] 🟪 GCN2: Visualize ↳ For node 3, visualize how messages are pooled to obtain a new feature for better understanding ↳ [1,2,4] + [1,3,5] + [0,0,1] = [2,5,10] [10] 🟦 FCN: Linear 1 + ReLU ↳ Multiply node features with weights and biases ↳ Apply ReLU (negatives → 0) ↳ The result is a new feature per node ↳ Unlike in GCN layers, no messages from other nodes are included. [11] 🟦 FCN: Linear 2 ↳ Multiply node features with weights and biases [12] 🟦 FCN: Sigmoid ↳ Apply the Sigmoid activation function ↳ The purpose is to obtain a probability value for each node ↳ One way to calculate Sigmoid by hand ✍️ is to use the approximation below: • >= 3 → 1 • 0 → 0.5 • <= -3 → 0 -- Outputs -- A: 0 (Very unlikely) B: 1 (Very likely) C: 1 (Very likely) D: 1 (Very likely) E: 0.5 (Neutral)

Tom Yeh

46,779 次观看 • 1 年前

Transformer by hand ✍️ ~ 6 steps walkthrough below Open the hood of a transformer and the parts list is overwhelming: embeddings, positional encoding, attention weighting, self-attention, cross-attention, multi-head attention, layer norm, skip connections, softmax, linear, Nx, shifted right, query, key, value, masking. Which of those actually make the car run? Two of them. Attention weighting and the feed-forward network. Everything else is an enhancement to make it run faster and longer, which is how we got from a car to a truck, and to the word "large" in large language model. So I drew and calculated those two parts entirely by hand. Goal: push five features through one transformer block, filling in every cell yourself. 1. Given Five positions of input features, arriving from the previous block. 2. Attention matrix Let us feed all five features to a query-key module (QK) and read back an attention weight matrix, A. The details of that module are a post of their own. 3. Attention weighting We multiply the input features by A to get the attention weighted features, Z. Still five positions. The effect is to combine features *across positions*, horizontally: X1 becomes X1 + X2, X2 becomes X2 + X3, and so on. 4. First layer Let us feed all five weighted features into the first layer of the FFN. Multiply by the weights and biases. This time the combining happens *across feature dimensions*, vertically, and each feature grows from 3 numbers to 4. Note that every position goes through the same weight matrix. That is what "position-wise" means. 5. ReLU We cross out the negatives. They become zeros. 6. Second layer Let us bring it back down: 4 dimensions to 3. The output feeds the next block, which has a completely separate set of parameters, and the whole thing runs again. You have just calculated a transformer block by hand. ✍️ The takeaway: the two parts are doing two different jobs, and neither one alone is enough. Attention mixes *across positions*, so a feature can see its neighbours. The FFN mixes *across feature dimensions*, so each position can think about itself. Horizontal, then vertical. Then that pattern repeats N times, each block with its own separate set of weights. That is the Nx from the list up top, and that is what makes the transformer run. 💾 Save this post! #AIbyHand #Transformers #DeepLearning

Transformer by hand ✍️ ~ 6 steps walkthrough below Open the hood of a transformer and the parts list is overwhelming: embeddings, positional encoding, attention weighting, self-attention, cross-attention, multi-head attention, layer norm, skip connections, softmax, linear, Nx, shifted right, query, key, value, masking. Which of those actually make the car run? Two of them. Attention weighting and the feed-forward network. Everything else is an enhancement to make it run faster and longer, which is how we got from a car to a truck, and to the word "large" in large language model. So I drew and calculated those two parts entirely by hand. Goal: push five features through one transformer block, filling in every cell yourself. 1. Given Five positions of input features, arriving from the previous block. 2. Attention matrix Let us feed all five features to a query-key module (QK) and read back an attention weight matrix, A. The details of that module are a post of their own. 3. Attention weighting We multiply the input features by A to get the attention weighted features, Z. Still five positions. The effect is to combine features across positions, horizontally: X1 becomes X1 + X2, X2 becomes X2 + X3, and so on. 4. First layer Let us feed all five weighted features into the first layer of the FFN. Multiply by the weights and biases. This time the combining happens across feature dimensions, vertically, and each feature grows from 3 numbers to 4. Note that every position goes through the same weight matrix. That is what "position-wise" means. 5. ReLU We cross out the negatives. They become zeros. 6. Second layer Let us bring it back down: 4 dimensions to 3. The output feeds the next block, which has a completely separate set of parameters, and the whole thing runs again. You have just calculated a transformer block by hand. ✍️ The takeaway: the two parts are doing two different jobs, and neither one alone is enough. Attention mixes across positions, so a feature can see its neighbours. The FFN mixes across feature dimensions, so each position can think about itself. Horizontal, then vertical. Then that pattern repeats N times, each block with its own separate set of weights. That is the Nx from the list up top, and that is what makes the transformer run. 💾 Save this post! #AIbyHand #Transformers #DeepLearning

Tom Yeh

25,796 次观看 • 14 天前

[Discrete Fourier Transform] by Hand ✍️ In signal processing, the Discrete Fourier Transform (DFT) is no doubt the most important method. But the math involved is extremely complex, literally, involving a summation over a complex number term e^(-iwt). I developed this exercise to demonstrate that underneath such complexity, DFT is just a series of matrix multiplications you can calculate by hand. ✍️ Once you see that, it should not surprise you that a deep neural network, which is also a series of matrix multiplications, with activation functions in-between, can learn to perform DFT to process and analyze signals so effectively. How does DFT work? [1] Given ↳ Signals A, B, and C in the 🟧 frequency domain: ◦ A = cos(w) + 2cos(2w) ◦ B = cos(w) + cos(3w) + cos(4w) ◦ C = -cos(2w) + cos(3w) ◦ Each signal is a weighed sum of four cosine waves at frequencies 1w, 2w, 3w, and 4w. ◦ We will apply Inverse DFT to convert the signals to time domain representations, and then demonstrate DFT can convert back to their original frequency domain representations. ↳ Signal X in the 🟩 time domain. X is sampled at 10 time points 1t, 2t, …, 10t: ◦ X = [-2.5, -1.8, 3, -0.7, -1.0, -0.7, 3, -1.8, -2.5, 5] ◦ Suppose X is also a weighted sum of the same four cosine waves, but we don’t already know their weights. We will apply DFT to discover them. [2] 🟧 Frequency Matrix (F) ↳ Write the coefficients of A, B, C as a matrix F. Each signal is a row. Each frequency is a column. ↳ A → [1, 2, 0, 0] ↳ B → [1, 0, 1, 1] ↳ C → [0, 1-, 1, 0] [3] Cosine → Discrete ↳ Sample from the continuous cosine waves at discrete time points 1t, 2t, 3t, to 10t. [4] Cosine Matrix (W) ↳ Write the samples as a matrix, Each frequency is a row. Each time point is a column. [5] Inverse DFT: 🟧 Frequency → 🟩 Time ↳ Multiply the frequency matrix F and the cosine matrix W. ↳ The meaning of this multiplication is to linearly combine the four cosine waves (rows in W) into time-domain signals (rows in T) using the weights specified in F. ↳ The result is matrix T, which are signals A, B, C converted to the time domain. Each signal is a row. Each time point is a column. [6] Transpose ↳ Transpose T, converting each signal’s time domain representation from a row to a column. [7] DFT: 🟩 Time → 🟧 Frequency ↳ Multiply the cosine matrix W with the transpose of matrix T. ↳ The purpose of this multiplication is to take a dot-product between each time-domain signal (columns in the transpose of T) and each cosine wave (rows in W), which has the effect of projecting the signal onto a cosine wave to determine how much they are correlated. Zero means not correlated at all. ↳ The result is an intermediate version of the “recovered” frequency matrix where each column corresponds to a signal and each row corresponds to a frequency. ↳ Compared to the original frequency matrix F, this intermediate matrix has non-zero weights in the correct places, but scaled up by a factor of 5 (n/2, n=10). For example, signal A, originally [1,2,0,0], is recovered at [5,10,0,0]. [8] Scale ↳ Multiply each value by 2/n = 1/5 to scale down the intermediate matrix to match the magnitude of the original frequency matrix F. [9] Transpose ↳ Transpose the recovered frequency matrix back to the same orientation of the original frequency matrix F. ↳ Like magic 🪄, the result is identical to the original F, which means DFT successfully recovered the frequency components of signals A, B, C. [10] Apply DFT to X: 🟩 Time → 🟧 Frequency ↳ Now that we have some confidence in DFT’s ability to recover frequency components, we apply DFT to X’s time-domain representation by multiplying W with X. ↳ The result is the an intermediate matrix. [11] Scale ↳ Similarly, we scale down by a factor of 5 to obtain the recovered frequency components of X (a column). [12] Transpose ↳ Similarly, we transpose the recovered column to row to match the orientation of the frequency matrix. ↳ Using the coefficients [0,0,3,2], we can write the equation of X as 3cos(3w) + 2cos(4w). Notes: I hope this by hand exercise helps you understand the essence of DFT. But there is more technical details, such as: • Sine: The complete DFT math also includes sine waves that follow a similar calculation process. • Phase: Here, we assume all the cosine waves are aligned at the origin, namely, phase is 0. If a phase p is added, for example, cos(w+p), we will need to calculate the sine component and use their ratio to figure out what p is. • Magnitude: If phase is not zero, the magnitude will need to be calculated by combining both cosine and sine terms.

[Discrete Fourier Transform] by Hand ✍️ In signal processing, the Discrete Fourier Transform (DFT) is no doubt the most important method. But the math involved is extremely complex, literally, involving a summation over a complex number term e^(-iwt). I developed this exercise to demonstrate that underneath such complexity, DFT is just a series of matrix multiplications you can calculate by hand. ✍️ Once you see that, it should not surprise you that a deep neural network, which is also a series of matrix multiplications, with activation functions in-between, can learn to perform DFT to process and analyze signals so effectively. How does DFT work? [1] Given ↳ Signals A, B, and C in the 🟧 frequency domain: ◦ A = cos(w) + 2cos(2w) ◦ B = cos(w) + cos(3w) + cos(4w) ◦ C = -cos(2w) + cos(3w) ◦ Each signal is a weighed sum of four cosine waves at frequencies 1w, 2w, 3w, and 4w. ◦ We will apply Inverse DFT to convert the signals to time domain representations, and then demonstrate DFT can convert back to their original frequency domain representations. ↳ Signal X in the 🟩 time domain. X is sampled at 10 time points 1t, 2t, …, 10t: ◦ X = [-2.5, -1.8, 3, -0.7, -1.0, -0.7, 3, -1.8, -2.5, 5] ◦ Suppose X is also a weighted sum of the same four cosine waves, but we don’t already know their weights. We will apply DFT to discover them. [2] 🟧 Frequency Matrix (F) ↳ Write the coefficients of A, B, C as a matrix F. Each signal is a row. Each frequency is a column. ↳ A → [1, 2, 0, 0] ↳ B → [1, 0, 1, 1] ↳ C → [0, 1-, 1, 0] [3] Cosine → Discrete ↳ Sample from the continuous cosine waves at discrete time points 1t, 2t, 3t, to 10t. [4] Cosine Matrix (W) ↳ Write the samples as a matrix, Each frequency is a row. Each time point is a column. [5] Inverse DFT: 🟧 Frequency → 🟩 Time ↳ Multiply the frequency matrix F and the cosine matrix W. ↳ The meaning of this multiplication is to linearly combine the four cosine waves (rows in W) into time-domain signals (rows in T) using the weights specified in F. ↳ The result is matrix T, which are signals A, B, C converted to the time domain. Each signal is a row. Each time point is a column. [6] Transpose ↳ Transpose T, converting each signal’s time domain representation from a row to a column. [7] DFT: 🟩 Time → 🟧 Frequency ↳ Multiply the cosine matrix W with the transpose of matrix T. ↳ The purpose of this multiplication is to take a dot-product between each time-domain signal (columns in the transpose of T) and each cosine wave (rows in W), which has the effect of projecting the signal onto a cosine wave to determine how much they are correlated. Zero means not correlated at all. ↳ The result is an intermediate version of the “recovered” frequency matrix where each column corresponds to a signal and each row corresponds to a frequency. ↳ Compared to the original frequency matrix F, this intermediate matrix has non-zero weights in the correct places, but scaled up by a factor of 5 (n/2, n=10). For example, signal A, originally [1,2,0,0], is recovered at [5,10,0,0]. [8] Scale ↳ Multiply each value by 2/n = 1/5 to scale down the intermediate matrix to match the magnitude of the original frequency matrix F. [9] Transpose ↳ Transpose the recovered frequency matrix back to the same orientation of the original frequency matrix F. ↳ Like magic 🪄, the result is identical to the original F, which means DFT successfully recovered the frequency components of signals A, B, C. [10] Apply DFT to X: 🟩 Time → 🟧 Frequency ↳ Now that we have some confidence in DFT’s ability to recover frequency components, we apply DFT to X’s time-domain representation by multiplying W with X. ↳ The result is the an intermediate matrix. [11] Scale ↳ Similarly, we scale down by a factor of 5 to obtain the recovered frequency components of X (a column). [12] Transpose ↳ Similarly, we transpose the recovered column to row to match the orientation of the frequency matrix. ↳ Using the coefficients [0,0,3,2], we can write the equation of X as 3cos(3w) + 2cos(4w). Notes: I hope this by hand exercise helps you understand the essence of DFT. But there is more technical details, such as: • Sine: The complete DFT math also includes sine waves that follow a similar calculation process. • Phase: Here, we assume all the cosine waves are aligned at the origin, namely, phase is 0. If a phase p is added, for example, cos(w+p), we will need to calculate the sine component and use their ratio to figure out what p is. • Magnitude: If phase is not zero, the magnitude will need to be calculated by combining both cosine and sine terms.

Tom Yeh

116,622 次观看 • 2 年前

[LSTM] by Hand ✍️ LSTMs have been the most effective architecture to process long sequences of data, until our world was taken over by the Transformers. LSTMs belong to the broader family of recurrent neural network (RNNs) that process data sequentially in a recurrent manner. Transformers, on the other hand, abandon recurrence and use self-attention instead to process data concurrently in parallel. Recently, there is renewed interest in recurrence as people realized self-attention doesn’t scale to extremely long sequences, like hundreds of thousands of tokens. Mamba is a good example to bring back recurrence. All of a sudden, it is cool to study LSTMs. How do LSTMs work? [1] Given ↳ 🟨 Input sequence X1, X2, X3 (d = 3) ↳ 🟩 Hidden state h (d = 2) ↳ 🟦 Memory C (d = 2) ↳ Weight matrices Wf, Wc, Wi, Wo Process t = 1 [2] Initialize ↳ Randomly set the previous hidden state h0 to [1, 1] and memory cells C0 to [0.3, -0.5] [3] Linear Transform ↳ Multiply the four weight matrices with the concatenation of current input (X1) and the previous hidden state (h0). ↳ The results are feature values, each is a linear combination of the current input and hidden state. [4] Non-linear Transform ↳ Apply sigmoid σ to obtain gate values (between 0 and 1). • Forget gate (f1): [-4, -6] → [0, 0] • Input gate (i1): [6, 4] → [1, 1] • Output gate (o1): [4, -5] → [1, 0] ↳ Apply tanh to obtain candidate memory values (between -1 and 1) • Candidate memory (C’1): [1, -6] → [0.8, -1] [5] Update Memory ↳ Forget (C0 .* f1): Element-wise multiply the current memory with forget gate values. ↳ Input (C’1 .* o1): Element-wise multiply the “candidate” memory with input gate values. ↳ Update the memory to C1 by adding the two terms above: C0 .* f1 + C’1 .* o1 = C1 [6] Candiate Output ↳ Apply tanh to the new memory C1 to obtain candidate output o’1. [0.8, -1] → [0.7, -0.8] [7] Update Hidden State ↳ Output (o’1 .* o1 → h1): Element-wise multiply the candidate output with the output gate. ↳ The result is updated hidden state h1 ↳ Also, it is the first output. Process t = 2 [8] Initialize ↳ Copy previous hidden state h1 and memory C1 [9] Linear Transform ↳ Repeat [3] [10] Update Memory (C2) ↳ Repeat [4] and [5] [11] Update Hidden State (h2) ↳ Repeat [6] and [7] Process t = 3 [12] Initialize ↳ Copy previous hidden state h2 and memory C2 [13] Linear Transform ↳ Repeat [3] [14] Update Memory (C3) ↳ Repeat [4] and [5] [15] Update Hidden State (h3) ↳ Repeat [6] and [7]

[LSTM] by Hand ✍️ LSTMs have been the most effective architecture to process long sequences of data, until our world was taken over by the Transformers. LSTMs belong to the broader family of recurrent neural network (RNNs) that process data sequentially in a recurrent manner. Transformers, on the other hand, abandon recurrence and use self-attention instead to process data concurrently in parallel. Recently, there is renewed interest in recurrence as people realized self-attention doesn’t scale to extremely long sequences, like hundreds of thousands of tokens. Mamba is a good example to bring back recurrence. All of a sudden, it is cool to study LSTMs. How do LSTMs work? [1] Given ↳ 🟨 Input sequence X1, X2, X3 (d = 3) ↳ 🟩 Hidden state h (d = 2) ↳ 🟦 Memory C (d = 2) ↳ Weight matrices Wf, Wc, Wi, Wo Process t = 1 [2] Initialize ↳ Randomly set the previous hidden state h0 to [1, 1] and memory cells C0 to [0.3, -0.5] [3] Linear Transform ↳ Multiply the four weight matrices with the concatenation of current input (X1) and the previous hidden state (h0). ↳ The results are feature values, each is a linear combination of the current input and hidden state. [4] Non-linear Transform ↳ Apply sigmoid σ to obtain gate values (between 0 and 1). • Forget gate (f1): [-4, -6] → [0, 0] • Input gate (i1): [6, 4] → [1, 1] • Output gate (o1): [4, -5] → [1, 0] ↳ Apply tanh to obtain candidate memory values (between -1 and 1) • Candidate memory (C’1): [1, -6] → [0.8, -1] [5] Update Memory ↳ Forget (C0 .* f1): Element-wise multiply the current memory with forget gate values. ↳ Input (C’1 .* o1): Element-wise multiply the “candidate” memory with input gate values. ↳ Update the memory to C1 by adding the two terms above: C0 .* f1 + C’1 .* o1 = C1 [6] Candiate Output ↳ Apply tanh to the new memory C1 to obtain candidate output o’1. [0.8, -1] → [0.7, -0.8] [7] Update Hidden State ↳ Output (o’1 .* o1 → h1): Element-wise multiply the candidate output with the output gate. ↳ The result is updated hidden state h1 ↳ Also, it is the first output. Process t = 2 [8] Initialize ↳ Copy previous hidden state h1 and memory C1 [9] Linear Transform ↳ Repeat [3] [10] Update Memory (C2) ↳ Repeat [4] and [5] [11] Update Hidden State (h2) ↳ Repeat [6] and [7] Process t = 3 [12] Initialize ↳ Copy previous hidden state h2 and memory C2 [13] Linear Transform ↳ Repeat [3] [14] Update Memory (C3) ↳ Repeat [4] and [5] [15] Update Hidden State (h3) ↳ Repeat [6] and [7]

Tom Yeh

72,966 次观看 • 2 年前

THIS GUY JUST REBUILT A $35,000 ANIMATED SITE FOR $12. IF YOU RUN A WEB STUDIO, YOU SHOULD PROBABLY KEEP SCROLLING. Every agency billing $100-149/hr is selling you five departments wearing one invoice. Here’s each one - collapsed into a single agentic session. LAYER 1 - THE CONCEPT ROOM (Claude) Reads the brief, pulls references, and scripts the scroll: what the visitor feels at second 3, second 15, second 40. → Used to be a strategist and a wall of mood boards. Now it’s a conversation. LAYER 2 - THE MOTION STUDIO (Higgsfield) Cinematic clips from 30+ generative models - hero shots, transitions, ambient loops - all matched to the story from Layer 1. → Used to be a motion artist on retainer. Now it’s a prompt. LAYER 3 - THE DEV TEAM (Claude Code) Scaffolds the site, writes the GSAP ScrollTrigger timelines and Lenis smooth-scroll, extracts frames, optimizes every asset. → A full scroll-driven build with zero hand-coded keyframes. LAYER 4 - THE DESIGN DEPT (baked-in cinematic layer) Six effects, zero config: film grain, particles, vignette, glass cards, color tints, scroll pacing. → The polish that justified the invoice - now it ships by default. LAYER 5 - THE QA PASS (Claude) Checks load speed, mobile breakpoints, and whether the scroll actually lands - then rewrites whatever doesn’t. → Used to be a client call and a revision cycle. Now it’s one more turn in the same session. Five departments. One operator. One pass. A strategist, a motion artist, a developer, a designer, and a QA lead - weeks of handoffs - now run in a single session. For a Claude subscription and a few dollars of Higgsfield credits. The studio was never selling talent. It was selling overhead. And the overhead just became five layers. Follow me, reply “website” to this post and I will send you the step-by-step Playbook 👇

THIS GUY JUST REBUILT A $35,000 ANIMATED SITE FOR $12. IF YOU RUN A WEB STUDIO, YOU SHOULD PROBABLY KEEP SCROLLING. Every agency billing $100-149/hr is selling you five departments wearing one invoice. Here’s each one - collapsed into a single agentic session. LAYER 1 - THE CONCEPT ROOM (Claude) Reads the brief, pulls references, and scripts the scroll: what the visitor feels at second 3, second 15, second 40. → Used to be a strategist and a wall of mood boards. Now it’s a conversation. LAYER 2 - THE MOTION STUDIO (Higgsfield) Cinematic clips from 30+ generative models - hero shots, transitions, ambient loops - all matched to the story from Layer 1. → Used to be a motion artist on retainer. Now it’s a prompt. LAYER 3 - THE DEV TEAM (Claude Code) Scaffolds the site, writes the GSAP ScrollTrigger timelines and Lenis smooth-scroll, extracts frames, optimizes every asset. → A full scroll-driven build with zero hand-coded keyframes. LAYER 4 - THE DESIGN DEPT (baked-in cinematic layer) Six effects, zero config: film grain, particles, vignette, glass cards, color tints, scroll pacing. → The polish that justified the invoice - now it ships by default. LAYER 5 - THE QA PASS (Claude) Checks load speed, mobile breakpoints, and whether the scroll actually lands - then rewrites whatever doesn’t. → Used to be a client call and a revision cycle. Now it’s one more turn in the same session. Five departments. One operator. One pass. A strategist, a motion artist, a developer, a designer, and a QA lead - weeks of handoffs - now run in a single session. For a Claude subscription and a few dollars of Higgsfield credits. The studio was never selling talent. It was selling overhead. And the overhead just became five layers. Follow me, reply “website” to this post and I will send you the step-by-step Playbook 👇

ZEUS⚡️

141,226 次观看 • 1 个月前

Microsoft made 100B parameter models run on a single CPU. bitnet.cpp: The official inference framework for 1-bit LLMs. The math behind 1-bit LLMs is what makes them revolutionary. Traditional LLMs use 16-bit floating point weights. Every parameter is a number like 0.0023847 or -1.4729. When you run inference, you multiply these floats together. Billions of times. That's why you need GPUs, they're optimized for floating point matrix multiplication. BitNet b1.58 uses ternary weights: {-1, 0, 1}. That's not a simplification. That's a fundamental change in the math. When your weights are only -1, 0, or 1: → Multiply by 1 = keep the value → Multiply by -1 = flip the sign → Multiply by 0 = skip entirely Matrix multiplication becomes addition and subtraction. No floating point operations. No GPU required. This is why bitnet.cpp achieves: → 2.37x to 6.17x speedup on x86 CPUs → 1.37x to 5.07x speedup on ARM CPUs → 71.9% to 82.2% energy reduction on x86 → 55.4% to 70.0% energy reduction on ARM The speedups scale with model size. Larger models see bigger gains because there are more operations to simplify. A 100B parameter model running at human reading speed (5-7 tokens/second) on a single CPU. That's not optimization. That's a different paradigm. Why 1.58 bits? Because log₂(3) ≈ 1.58. Three possible values = 1.58 bits of information per weight. The key insight: These models aren't quantized after training. They're trained from scratch with ternary weights. The model learns to work within the constraint. No precision loss. No quality tradeoff.

Microsoft made 100B parameter models run on a single CPU. bitnet.cpp: The official inference framework for 1-bit LLMs. The math behind 1-bit LLMs is what makes them revolutionary. Traditional LLMs use 16-bit floating point weights. Every parameter is a number like 0.0023847 or -1.4729. When you run inference, you multiply these floats together. Billions of times. That's why you need GPUs, they're optimized for floating point matrix multiplication. BitNet b1.58 uses ternary weights: {-1, 0, 1}. That's not a simplification. That's a fundamental change in the math. When your weights are only -1, 0, or 1: → Multiply by 1 = keep the value → Multiply by -1 = flip the sign → Multiply by 0 = skip entirely Matrix multiplication becomes addition and subtraction. No floating point operations. No GPU required. This is why bitnet.cpp achieves: → 2.37x to 6.17x speedup on x86 CPUs → 1.37x to 5.07x speedup on ARM CPUs → 71.9% to 82.2% energy reduction on x86 → 55.4% to 70.0% energy reduction on ARM The speedups scale with model size. Larger models see bigger gains because there are more operations to simplify. A 100B parameter model running at human reading speed (5-7 tokens/second) on a single CPU. That's not optimization. That's a different paradigm. Why 1.58 bits? Because log₂(3) ≈ 1.58. Three possible values = 1.58 bits of information per weight. The key insight: These models aren't quantized after training. They're trained from scratch with ternary weights. The model learns to work within the constraint. No precision loss. No quality tradeoff.

Tech with Mak

23,036 次观看 • 3 个月前

Vector Database by Hand ✍️ Vector databases are revolutionizing how we search and analyze complex data. They have become the backbone of Retrieval Augmented Generation (#RAG). How do vector databases work? [1] Given ↳ A dataset of three sentences, each has 3 words (or tokens) ↳ In practice, a dataset may contain millions or billions of sentences. The max number of tokens may be tens of thousands (e.g., 32,768 mistral-7b). Process "how are you" [2] 🟨 Word Embeddings ↳ For each word, look up corresponding word embedding vector from a table of 22 vectors, where 22 is the vocabulary size. ↳ In practice, the vocabulary size can be tens of thousands. The word embedding dimensions are in the thousands (e.g., 1024, 4096) [3] 🟩 Encoding ↳ Feed the sequence of word embeddings to an encoder to obtain a sequence of feature vectors, one per word. ↳ Here, the encoder is a simple one layer perceptron (linear layer + ReLU) ↳ In practice, the encoder is a transformer or one of its many variants. [4] 🟩 Mean Pooling ↳ Merge the sequence of feature vectors into a single vector using "mean pooling" which is to average across the columns. ↳ The result is a single vector. We often call it "text embeddings" or "sentence embeddings." ↳ Other pooling techniques are possible, such as CLS. But mean pooling is the most common. [5] 🟦 Indexing ↳ Reduce the dimensions of the text embedding vector by a projection matrix. The reduction rate is 50% (4->2). ↳ In practice, the values in this projection matrix is much more random. ↳ The purpose is similar to that of hashing, which is to obtain a short representation to allow faster comparison and retrieval. ↳ The resulting dimension-reduced index vector is saved in the vector storage. [6] Process "who are you" ↳ Repeat [2]-[5] [7] Process "who am I" ↳ Repeat [2]-[5] Now we have indexed our dataset in the vector database. [8] 🟥 Query: "am I you" ↳ Repeat [2]-[5] ↳ The result is a 2-d query vector. [9] 🟥 Dot Products ↳ Take dot product between the query vector and database vectors. They are all 2-d. ↳ The purpose is to use dot product to estimate similarity. ↳ By transposing the query vector, this step becomes a matrix multiplication. [10] 🟥 Nearest Neighbor ↳ Find the largest dot product by linear scan. ↳ The sentence with the highest dot product is "who am I" ↳ In practice, because scanning billions of vectors is slow, we use an Approximate Nearest Neighbor (ANN) algorithm like the Hierarchical Navigable Small Worlds (HNSW).

Vector Database by Hand ✍️ Vector databases are revolutionizing how we search and analyze complex data. They have become the backbone of Retrieval Augmented Generation (#RAG). How do vector databases work? [1] Given ↳ A dataset of three sentences, each has 3 words (or tokens) ↳ In practice, a dataset may contain millions or billions of sentences. The max number of tokens may be tens of thousands (e.g., 32,768 mistral-7b). Process "how are you" [2] 🟨 Word Embeddings ↳ For each word, look up corresponding word embedding vector from a table of 22 vectors, where 22 is the vocabulary size. ↳ In practice, the vocabulary size can be tens of thousands. The word embedding dimensions are in the thousands (e.g., 1024, 4096) [3] 🟩 Encoding ↳ Feed the sequence of word embeddings to an encoder to obtain a sequence of feature vectors, one per word. ↳ Here, the encoder is a simple one layer perceptron (linear layer + ReLU) ↳ In practice, the encoder is a transformer or one of its many variants. [4] 🟩 Mean Pooling ↳ Merge the sequence of feature vectors into a single vector using "mean pooling" which is to average across the columns. ↳ The result is a single vector. We often call it "text embeddings" or "sentence embeddings." ↳ Other pooling techniques are possible, such as CLS. But mean pooling is the most common. [5] 🟦 Indexing ↳ Reduce the dimensions of the text embedding vector by a projection matrix. The reduction rate is 50% (4->2). ↳ In practice, the values in this projection matrix is much more random. ↳ The purpose is similar to that of hashing, which is to obtain a short representation to allow faster comparison and retrieval. ↳ The resulting dimension-reduced index vector is saved in the vector storage. [6] Process "who are you" ↳ Repeat [2]-[5] [7] Process "who am I" ↳ Repeat [2]-[5] Now we have indexed our dataset in the vector database. [8] 🟥 Query: "am I you" ↳ Repeat [2]-[5] ↳ The result is a 2-d query vector. [9] 🟥 Dot Products ↳ Take dot product between the query vector and database vectors. They are all 2-d. ↳ The purpose is to use dot product to estimate similarity. ↳ By transposing the query vector, this step becomes a matrix multiplication. [10] 🟥 Nearest Neighbor ↳ Find the largest dot product by linear scan. ↳ The sentence with the highest dot product is "who am I" ↳ In practice, because scanning billions of vectors is slow, we use an Approximate Nearest Neighbor (ANN) algorithm like the Hierarchical Navigable Small Worlds (HNSW).

Tom Yeh

192,022 次观看 • 2 年前

I created this desk calendar as a source of inspiration for anyone learning AI in 2026. It includes 24 AI algorithms and architectures, all drawn and calculated by hand. ✍️ 𝗝𝗮𝗻𝘂𝗮𝗿𝘆: [1] Matrix Multiplication; [2] Discrete Fourier Transform (DFT) 𝗙𝗲𝗯𝗿𝘂𝗮𝗿𝘆: [3] Support Vector Machine (SVM); [4] Vector Database 𝗠𝗮𝗿𝗰𝗵: [5] Multi-Layer Perceptron (MLP); [6] Backpropagation 𝗔𝗽𝗿𝗶𝗹: [7] Batchnorm; [8] Dropout 𝗠𝗮𝘆: [9] Recurrent Neural Network (RNN); [10] Long-Short Term Memory (LSTM) 𝗝𝘂𝗻𝗲: [11] Residual Network (ResNet); [12] Graph Convolutional Network (GCN) 𝗝𝘂𝗹𝘆: [13] Autoencoder; [14] Variational Autoencoder (VAE) 𝗔𝘂𝗴𝘂𝘀𝘁: [15] Generative Adversarial Network (GAN); [16] U-Net 𝗦𝗲𝗽𝘁𝗲𝗺𝗯𝗲𝗿: [17] Transformer; [18] Self Attention 𝗢𝗰𝘁𝗼𝗯𝗲𝗿: [19] Reinforcement Learning with Human Feedback (RLHF); [20] Contrastive Language-Image Pre-training (CLIP) 𝗡𝗼𝘃𝗲𝗺𝗯𝗲𝗿: [21] Diffusion Transformer; [22] Switch Transformer 𝗗𝗲𝗰𝗲𝗺𝗯𝗲𝗿: [23] Sparse Autoencoder; [24] BitNet

I created this desk calendar as a source of inspiration for anyone learning AI in 2026. It includes 24 AI algorithms and architectures, all drawn and calculated by hand. ✍️ 𝗝𝗮𝗻𝘂𝗮𝗿𝘆: [1] Matrix Multiplication; [2] Discrete Fourier Transform (DFT) 𝗙𝗲𝗯𝗿𝘂𝗮𝗿𝘆: [3] Support Vector Machine (SVM); [4] Vector Database 𝗠𝗮𝗿𝗰𝗵: [5] Multi-Layer Perceptron (MLP); [6] Backpropagation 𝗔𝗽𝗿𝗶𝗹: [7] Batchnorm; [8] Dropout 𝗠𝗮𝘆: [9] Recurrent Neural Network (RNN); [10] Long-Short Term Memory (LSTM) 𝗝𝘂𝗻𝗲: [11] Residual Network (ResNet); [12] Graph Convolutional Network (GCN) 𝗝𝘂𝗹𝘆: [13] Autoencoder; [14] Variational Autoencoder (VAE) 𝗔𝘂𝗴𝘂𝘀𝘁: [15] Generative Adversarial Network (GAN); [16] U-Net 𝗦𝗲𝗽𝘁𝗲𝗺𝗯𝗲𝗿: [17] Transformer; [18] Self Attention 𝗢𝗰𝘁𝗼𝗯𝗲𝗿: [19] Reinforcement Learning with Human Feedback (RLHF); [20] Contrastive Language-Image Pre-training (CLIP) 𝗡𝗼𝘃𝗲𝗺𝗯𝗲𝗿: [21] Diffusion Transformer; [22] Switch Transformer 𝗗𝗲𝗰𝗲𝗺𝗯𝗲𝗿: [23] Sparse Autoencoder; [24] BitNet

Tom Yeh

14,246 次观看 • 8 个月前

Full Fine-tuning vs. Freezing Layers. Interact 👉 and == Full Fine-tuning == A real network has many — three layers in this example, billions of parameters in a production model. What does fine-tuning look like when you update all of them? That’s full fine-tuning: continue training every weight in the pretrained network on your new task. Every layer’s W gets its own ΔW. Nothing is frozen — every parameter is in play. Think of an MLP as a chain of prerequisites leading to an advanced course. Layer 1 might be Linear Algebra, layer 2 Probability, layer 3 Advanced Machine Learning — each one building on what came before. Fine-tuning is what happens during graduate study: the foundations are already there from undergrad, so you’re not re-learning. Full fine-tuning is reviewing every prerequisite to see what new topics have appeared and what discoveries the field has made since the last time you sat through them. Effective — but exhausting. This diagram shows the same three-layer MLP twice, side by side. On the left, the pretrained network runs on input X: three weight matrices W₁, W₂, W₃, each followed by a ReLU activation. Full fine-tuning gives the model the most freedom to specialize. Every parameter can move — and every parameter that can move must be stored. But not every prerequisite needs revisiting. The further you go back in the chain, the less the material has changed since pretraining — the linear-algebra basics under your computer-vision course are largely the same as they ever were. The next page does exactly that: freeze the prerequisites that haven’t moved, and only refresh the advanced one closest to your specialization. == Freezing Layers == Full fine-tuning reviewed every prerequisite — Linear Algebra, Probability, Advanced ML — to refresh each subject with the latest topics. Effective, but exhausting. Then you realize something. The prerequisites haven’t actually changed that much. Linear Algebra is still Linear Algebra; the matrix decompositions you learned still hold. Probability is still Probability; the distributions and Bayes’ rule haven’t moved. Almost all the new material — the new ideas, the recent discoveries — lives in the advanced layer at the top. That’s freezing layers: keep the prerequisite layers fixed at their pretrained state, and only update the advanced one. In the diagram below, W1 and W2 — the foundational prerequisites — stay frozen. Only W3 — the layer closest to your task-specific output — gets a ΔW.

Full Fine-tuning vs. Freezing Layers. Interact 👉 and == Full Fine-tuning == A real network has many — three layers in this example, billions of parameters in a production model. What does fine-tuning look like when you update all of them? That’s full fine-tuning: continue training every weight in the pretrained network on your new task. Every layer’s W gets its own ΔW. Nothing is frozen — every parameter is in play. Think of an MLP as a chain of prerequisites leading to an advanced course. Layer 1 might be Linear Algebra, layer 2 Probability, layer 3 Advanced Machine Learning — each one building on what came before. Fine-tuning is what happens during graduate study: the foundations are already there from undergrad, so you’re not re-learning. Full fine-tuning is reviewing every prerequisite to see what new topics have appeared and what discoveries the field has made since the last time you sat through them. Effective — but exhausting. This diagram shows the same three-layer MLP twice, side by side. On the left, the pretrained network runs on input X: three weight matrices W₁, W₂, W₃, each followed by a ReLU activation. Full fine-tuning gives the model the most freedom to specialize. Every parameter can move — and every parameter that can move must be stored. But not every prerequisite needs revisiting. The further you go back in the chain, the less the material has changed since pretraining — the linear-algebra basics under your computer-vision course are largely the same as they ever were. The next page does exactly that: freeze the prerequisites that haven’t moved, and only refresh the advanced one closest to your specialization. == Freezing Layers == Full fine-tuning reviewed every prerequisite — Linear Algebra, Probability, Advanced ML — to refresh each subject with the latest topics. Effective, but exhausting. Then you realize something. The prerequisites haven’t actually changed that much. Linear Algebra is still Linear Algebra; the matrix decompositions you learned still hold. Probability is still Probability; the distributions and Bayes’ rule haven’t moved. Almost all the new material — the new ideas, the recent discoveries — lives in the advanced layer at the top. That’s freezing layers: keep the prerequisite layers fixed at their pretrained state, and only update the advanced one. In the diagram below, W1 and W2 — the foundational prerequisites — stay frozen. Only W3 — the layer closest to your task-specific output — gets a ΔW.

Tom Yeh

27,587 次观看 • 3 个月前

Breaking news 🗞️ 🚨 Tesla just quietly solved a problem in Australia. The Model X is gone from our market. But the new Model Y L Premium AWD might actually be the closest thing we have to a replacement. And honestly… it makes a lot of sense. ⚡ Tesla Model Y L – Key Specs • 0–100 km/h: ~5.0 sec • Range: ~681 km WLTP • Top speed: 201 km/h • Seating: 6 adults • Supercharging: 250 kW • ~288 km added in 15 min 💰 Australian pricing (before on-road costs) Model Y Long Range AWD → $68,900 Model Y L Premium AWD → ~$74,900 So for roughly $6k more, you get: ✔ 3 rows ✔ 6 seats ✔ Much larger cabin ✔ ~400L extra cargo capacity ✔ Longer wheelbase ✔ Even more range 📊 Quick comparison Model Y Long Range AWD • 5 seats • ~600 km range • 0–100 km/h: 4.8 sec • $68,900 Model Y L Premium AWD • 6 seats • ~681 km range • 0–100 km/h: 5.0 sec • ~$74,900 So performance drops slightly, but practicality goes way up. 🇦🇺 Why this matters in Australia Since Tesla stopped selling the Model X locally, there has been a real gap in the lineup for larger families. The Model Y L doesn’t completely replace the X. You lose things like: ❌ Adaptive air suspension ❌ Driver instrument cluster ❌ Falcon Wing doors ❌ Some luxury interior touches But you still get: ✔ Tesla software ecosystem ✔ Supercharger network ✔ Massive range ✔ Practical 3-row seating And at a much lower price than a Model X ever was. 👨‍👩‍👧‍👦 Who this is perfect for • Growing families • Current Model Y owners needing more space • Former Model X buyers • Anyone considering EV9 / EX90 but wanting Tesla’s ecosystem Personally, as a Model X owner, this is the first Tesla sold in Australia that actually feels like a realistic successor. I’m seriously considering replacing my Model X with the Model Y L, possibly around the end of Q2 or mid-Q3 this year. Not a perfect Model X replacement. But for Australia right now? This might be Tesla’s smartest family vehicle yet. ⚡🇦🇺 ORDER NOW : Tesla Australia & New Zealand Tesla AI

Breaking news 🗞️ 🚨 Tesla just quietly solved a problem in Australia. The Model X is gone from our market. But the new Model Y L Premium AWD might actually be the closest thing we have to a replacement. And honestly… it makes a lot of sense. ⚡ Tesla Model Y L – Key Specs • 0–100 km/h: ~5.0 sec • Range: ~681 km WLTP • Top speed: 201 km/h • Seating: 6 adults • Supercharging: 250 kW • ~288 km added in 15 min 💰 Australian pricing (before on-road costs) Model Y Long Range AWD → $68,900 Model Y L Premium AWD → ~$74,900 So for roughly $6k more, you get: ✔ 3 rows ✔ 6 seats ✔ Much larger cabin ✔ ~400L extra cargo capacity ✔ Longer wheelbase ✔ Even more range 📊 Quick comparison Model Y Long Range AWD • 5 seats • ~600 km range • 0–100 km/h: 4.8 sec • $68,900 Model Y L Premium AWD • 6 seats • ~681 km range • 0–100 km/h: 5.0 sec • ~$74,900 So performance drops slightly, but practicality goes way up. 🇦🇺 Why this matters in Australia Since Tesla stopped selling the Model X locally, there has been a real gap in the lineup for larger families. The Model Y L doesn’t completely replace the X. You lose things like: ❌ Adaptive air suspension ❌ Driver instrument cluster ❌ Falcon Wing doors ❌ Some luxury interior touches But you still get: ✔ Tesla software ecosystem ✔ Supercharger network ✔ Massive range ✔ Practical 3-row seating And at a much lower price than a Model X ever was. 👨‍👩‍👧‍👦 Who this is perfect for • Growing families • Current Model Y owners needing more space • Former Model X buyers • Anyone considering EV9 / EX90 but wanting Tesla’s ecosystem Personally, as a Model X owner, this is the first Tesla sold in Australia that actually feels like a realistic successor. I’m seriously considering replacing my Model X with the Model Y L, possibly around the end of Q2 or mid-Q3 this year. Not a perfect Model X replacement. But for Australia right now? This might be Tesla’s smartest family vehicle yet. ⚡🇦🇺 ORDER NOW : Tesla Australia & New Zealand Tesla AI

Tesla in the Gong 🇦🇺🦘🤖🚕

21,519 次观看 • 4 个月前

$IREN "we haven't disclosed the specific amount of GPUs" 1. 🤮 reminds me of $NBIS 2. Setting a terrible precedent here for future deals 3. Making it purposely difficult, to not let analysts properly value your 2027 revenue 4. Increasing the polarized view on IREN by the market However: "approximately 60MW of air-cooled Blackwells" 1. You typically don't talk about gross capacity in a deployment like this 2. If it would be gross capacity, the GPU hour rate at IT level would be crazy high (at PUE 1.2, $680m / 50 = 13.6m/MW) 3. At 60MW IT load, and ~14kW draw at DGX server level, we can get to ~4,286 DGX systems with 8 GPUs per. 4. Based on this we can conclude that 60MW of IT load can run approximately 34k DGX B300. 5. 34k DGX B300 at $680m/yr, would represent a GPU hour price of $2.28 Now this is the problem with not disclosing your GPU quantity. You purposely make your business model look bad, because by approach, you get to a GPU hour price that would imply a payback period of 4 years, where only the last year of the contract is 100% margin. But of course, we can also take "the glass is half full" approach. IREN has ordered 50K B300s from Dell. They have 2 purchase orders for this, 1 between Dell Canada and IE CA Leasing Ltd for 4 phases, and 1 between Dell USA and IE US Hardware 1 Inc (amended from IE US Hardware 4 Inc on April 27, 2026). The order for Canada is divided in 4 phases, and are going to Mackenzie for 80MW of gross capacity, which happens to be 4 buildings of 20MW. The order for Childress is divided in 2 phases, and are going to DC35 and DC36, (as depicted in the earnings presentation) and those are 50MW gross. The purchase price of the order for Childress was $1.2B, and for Canada it was $2.3B If we go with 50,000 B300s for a total of $3.5B then $1.2 would represent 34.285% of the 50,000 GPUs, or 17,140 B300s rounded down. For this calculation I will consider that $IREN will deploy 17,140 GPUs in 50MW gross capacity in DC35 and DC36 of block 3 in Childress.. That would imply at 1.2 PUE, IREN can run 17,140 B300s in 41.67MW IT load. Now by that ratio, they can run 24,680 GPUs in 60MW IT load — a massive difference with 34k units through the Nvidia DGX reference calculation. If common sense is applied, you can still get to 2 completely different outcomes, that show a difference of more than 9k GPUs. The GPU hour rate at 24.68k GPUs would be $3.145 per B300, as MASSIVE difference from the earlier calculated $2.28. Sure, the DGX system may be a factor here. And I'm sure that the reality is somewhere in the middle. But I personally hate this as an investor, to be unable to calculate profitability on unit economic basis. After all, contracts are signed on a $/GPU hour basis. Why hide this from your investors? Not being able to calculate payback periods, unable to calculate ROIC. And most importantly, we cannot properly assess the $NVDA deal on a contract basis. I really hope the payback period of this contract is not 4 years. I want the glass to be half full, but by starting to censor the purchases, IREN is taking a step in the wrong direction. Not a fan of this.

$IREN "we haven't disclosed the specific amount of GPUs" 1. 🤮 reminds me of $NBIS 2. Setting a terrible precedent here for future deals 3. Making it purposely difficult, to not let analysts properly value your 2027 revenue 4. Increasing the polarized view on IREN by the market However: "approximately 60MW of air-cooled Blackwells" 1. You typically don't talk about gross capacity in a deployment like this 2. If it would be gross capacity, the GPU hour rate at IT level would be crazy high (at PUE 1.2, $680m / 50 = 13.6m/MW) 3. At 60MW IT load, and ~14kW draw at DGX server level, we can get to ~4,286 DGX systems with 8 GPUs per. 4. Based on this we can conclude that 60MW of IT load can run approximately 34k DGX B300. 5. 34k DGX B300 at $680m/yr, would represent a GPU hour price of $2.28 Now this is the problem with not disclosing your GPU quantity. You purposely make your business model look bad, because by approach, you get to a GPU hour price that would imply a payback period of 4 years, where only the last year of the contract is 100% margin. But of course, we can also take "the glass is half full" approach. IREN has ordered 50K B300s from Dell. They have 2 purchase orders for this, 1 between Dell Canada and IE CA Leasing Ltd for 4 phases, and 1 between Dell USA and IE US Hardware 1 Inc (amended from IE US Hardware 4 Inc on April 27, 2026). The order for Canada is divided in 4 phases, and are going to Mackenzie for 80MW of gross capacity, which happens to be 4 buildings of 20MW. The order for Childress is divided in 2 phases, and are going to DC35 and DC36, (as depicted in the earnings presentation) and those are 50MW gross. The purchase price of the order for Childress was $1.2B, and for Canada it was $2.3B If we go with 50,000 B300s for a total of $3.5B then $1.2 would represent 34.285% of the 50,000 GPUs, or 17,140 B300s rounded down. For this calculation I will consider that $IREN will deploy 17,140 GPUs in 50MW gross capacity in DC35 and DC36 of block 3 in Childress.. That would imply at 1.2 PUE, IREN can run 17,140 B300s in 41.67MW IT load. Now by that ratio, they can run 24,680 GPUs in 60MW IT load — a massive difference with 34k units through the Nvidia DGX reference calculation. If common sense is applied, you can still get to 2 completely different outcomes, that show a difference of more than 9k GPUs. The GPU hour rate at 24.68k GPUs would be $3.145 per B300, as MASSIVE difference from the earlier calculated $2.28. Sure, the DGX system may be a factor here. And I'm sure that the reality is somewhere in the middle. But I personally hate this as an investor, to be unable to calculate profitability on unit economic basis. After all, contracts are signed on a $/GPU hour basis. Why hide this from your investors? Not being able to calculate payback periods, unable to calculate ROIC. And most importantly, we cannot properly assess the $NVDA deal on a contract basis. I really hope the payback period of this contract is not 4 years. I want the glass to be half full, but by starting to censor the purchases, IREN is taking a step in the wrong direction. Not a fan of this.

Frans Bakker

148,167 次观看 • 2 个月前

Is Using a Nebulizer with Food Grade Hydrogen Peroxide preventive and healing for upper respiratory symptoms? The earlier you get this in the better, so if you start feeling run down or stuffy take action sooner. ***THIS IS NOT MEDICAL ADVICE. Not only can nebulizing save you money, but also time from healing and potentially from the damage/side effects that antibiotics can have. I used a nasal mist sprayer with a 50/50 mix of 3% HP to distilled water and ate raw chopped garlic too when I had pneumonia and I relieved congestion, coughing or anything chest related. Just aches and pains. I’m all about prevention and using natural medicines first! “The inhalation of HP by nebulization has been shown to be extremely effective for the rapid elimination of any pathogen presence in the sinuses, nose, throat, and deep into the lungs.” -Dr. Levy HOW TO PREPARE HYDROGEN PEROXIDE FOR NEBULIZING 1. Add 2 tsp of 3% food grade hydrogen peroxide to 8 oz of saline water (this makes.1% dilution), if you are starting with 12% hydrogen peroxide, add 1/2 tsp to 8 oz of saline water. 2. Transfer this mix to a glass dropper bottle. 3. Use about 2-3 mL or (1/2 tsp) of this mix for each nebulizing session. 4. You can keep this solution refrigerated for a long time and continue to reuse it. Praying for your health and abundance of goodness 🙌🏼

Sensitive content

Is Using a Nebulizer with Food Grade Hydrogen Peroxide preventive and healing for upper respiratory symptoms? The earlier you get this in the better, so if you start feeling run down or stuffy take action sooner. ***THIS IS NOT MEDICAL ADVICE. Not only can nebulizing save you money, but also time from healing and potentially from the damage/side effects that antibiotics can have. I used a nasal mist sprayer with a 50/50 mix of 3% HP to distilled water and ate raw chopped garlic too when I had pneumonia and I relieved congestion, coughing or anything chest related. Just aches and pains. I’m all about prevention and using natural medicines first! “The inhalation of HP by nebulization has been shown to be extremely effective for the rapid elimination of any pathogen presence in the sinuses, nose, throat, and deep into the lungs.” -Dr. Levy HOW TO PREPARE HYDROGEN PEROXIDE FOR NEBULIZING 1. Add 2 tsp of 3% food grade hydrogen peroxide to 8 oz of saline water (this makes.1% dilution), if you are starting with 12% hydrogen peroxide, add 1/2 tsp to 8 oz of saline water. 2. Transfer this mix to a glass dropper bottle. 3. Use about 2-3 mL or (1/2 tsp) of this mix for each nebulizing session. 4. You can keep this solution refrigerated for a long time and continue to reuse it. Praying for your health and abundance of goodness 🙌🏼

Cleanse Parasites .com 🧹🪱 Herbal Cleanse Co.

46,711 次观看 • 3 个月前

THE FOUR HORSEMEN STORY DOESN’T MAKE SENSE!!!🚨🚨🚨 Thread🧵 Notice : this is pure speculation and nothing about this is factual information. All the information provided is fully based on what we see/saw on the internet. Please RT for awareness. First of all i would just like to say that A-Reece has no reason not to answer his phone for a whole month especially from a guy that he worked with before and has a good working relationship with. Something just doesnt feel right. Get your popcorn🍿 1. This first started when Nasty C leaked or hinted on being on the same song as A-Reece. Note that Nasty C specifically mentioned that Stogie T asked him to do a hook for him and says he will work on the verse in the meantime, meaning that he was initially asked to do both a hook and a verse. Also note that the Date of this interview is back in may meaning that the song had been in the making for awhile. BET! 2. Stogie T finally talks about the after it comes out on a radio show. He clearly states that he sent them both the beat and with Nasty C he talks about a hook and with A-Reece he is clearly referring to a verse/bars since he says “things that Slimes know him for”. BET! 3. L Tido invites MAGGZ on his podcast and he clearly indicates that MAGGZ is “ GONNA” (note that word) have the best verse on the song and during this time L Tido only knew that only Stogie T, A-Reece and MAGGZ will have verses while Nasty C is on the hook and i will prove that on number 4. So by this time L Tido already underestimates Stogie T and A-Reece pen. 4. After the song drops L Tido’s tone changes from MAGGZ to Nasty C having the best verse and clearly indicates that Nasty C didn’t have a verse and was initially supposed to be on only the hook and indicates again that Nasty C sent his verse a week before the song released. The song got released on the 28th of November meaning that Nasty C sent his verse during the week 16th - 22th This is 4 proofs that Nasty C was asked for a hook. Note and pay proper attention as this leads into 5 down below…

THE FOUR HORSEMEN STORY DOESN’T MAKE SENSE!!!🚨🚨🚨 Thread🧵 Notice : this is pure speculation and nothing about this is factual information. All the information provided is fully based on what we see/saw on the internet. Please RT for awareness. First of all i would just like to say that A-Reece has no reason not to answer his phone for a whole month especially from a guy that he worked with before and has a good working relationship with. Something just doesnt feel right. Get your popcorn🍿 1. This first started when Nasty C leaked or hinted on being on the same song as A-Reece. Note that Nasty C specifically mentioned that Stogie T asked him to do a hook for him and says he will work on the verse in the meantime, meaning that he was initially asked to do both a hook and a verse. Also note that the Date of this interview is back in may meaning that the song had been in the making for awhile. BET! 2. Stogie T finally talks about the after it comes out on a radio show. He clearly states that he sent them both the beat and with Nasty C he talks about a hook and with A-Reece he is clearly referring to a verse/bars since he says “things that Slimes know him for”. BET! 3. L Tido invites MAGGZ on his podcast and he clearly indicates that MAGGZ is “ GONNA” (note that word) have the best verse on the song and during this time L Tido only knew that only Stogie T, A-Reece and MAGGZ will have verses while Nasty C is on the hook and i will prove that on number 4. So by this time L Tido already underestimates Stogie T and A-Reece pen. 4. After the song drops L Tido’s tone changes from MAGGZ to Nasty C having the best verse and clearly indicates that Nasty C didn’t have a verse and was initially supposed to be on only the hook and indicates again that Nasty C sent his verse a week before the song released. The song got released on the 28th of November meaning that Nasty C sent his verse during the week 16th - 22th This is 4 proofs that Nasty C was asked for a hook. Note and pay proper attention as this leads into 5 down below…

theboyjay

34,532 次观看 • 6 个月前

10 reasons why I am always happy! 😃 1)I am beautiful. 2) I am hardworking. 3)I am intelligent. 4)I am caramel color. 5) I am AA and O+ 6) I have a natural body (no surgery) 7) l am a physicist that loves maths. 8) I am 💯 healthy ( no STD’s, or terminal disease etc) 9) I am no baby mama to any dude I am not married to.. 10) I always believe in myself no matter how people see me or things.. Lastly let me add this… I am highly favored and protected divinely… Please note! This is not to market myself in anyway because people already can see this.. I am just stating the reason I always wear a smile 😊

10 reasons why I am always happy! 😃 1)I am beautiful. 2) I am hardworking. 3)I am intelligent. 4)I am caramel color. 5) I am AA and O+ 6) I have a natural body (no surgery) 7) l am a physicist that loves maths. 8) I am 💯 healthy ( no STD’s, or terminal disease etc) 9) I am no baby mama to any dude I am not married to.. 10) I always believe in myself no matter how people see me or things.. Lastly let me add this… I am highly favored and protected divinely… Please note! This is not to market myself in anyway because people already can see this.. I am just stating the reason I always wear a smile 😊

Chioma Oji

22,332 次观看 • 10 个月前

1 Neural Network + Obsidian + Karpathy’s 1-file method = the most unhinged second brain build of 2026. It remembers everything you’ve ever done, and it costs $0 on top of what you already pay. The base is Karpathy’s append and review: 1 giant note, new thoughts stack on top, old ones sink, every few days you reread and pull the survivors back up. No folders, no tags, no plugins the rereading IS the system, because review is what turns storage into thinking. The flaw: past 10,000 lines, no human rereads anything. That’s where the neural network takes over. You keep the note in Obsidian 1 vault, everything dumps to the top: ideas, links, meeting fragments, half-thoughts. You never organize, you only dump. It all lives as plain markdown on your own disk, and that detail is the whole trick. Because now you point Claude Code at the vault folder, and it reads every line you’ve ever written. “What did I think about pricing in March.” “Find the 3 ideas I keep circling.” “What did I drop that deserves a second look.” It answers from YOUR notes, with quotes, in 15 seconds. Then once a week, 1 prompt closes the loop: read the last 7 days, surface the 5 entries worth pulling back up, flag anything that contradicts what I wrote a month ago. The model does the sinking and surfacing Karpathy did by hand, and the note stays alive instead of turning into a graveyard. Week 1 feels like nothing. Week 4 you hit the first “I already solved this in January.” Month 3 you consult your past self more than Google. Most second brains die in 11 days under 40 plugins and 200 folders. This one is 1 file and a loop, and it compounds because dumping takes 0 discipline. Notion stores what you thought. This thing argues back.

1 Neural Network + Obsidian + Karpathy’s 1-file method = the most unhinged second brain build of 2026. It remembers everything you’ve ever done, and it costs $0 on top of what you already pay. The base is Karpathy’s append and review: 1 giant note, new thoughts stack on top, old ones sink, every few days you reread and pull the survivors back up. No folders, no tags, no plugins the rereading IS the system, because review is what turns storage into thinking. The flaw: past 10,000 lines, no human rereads anything. That’s where the neural network takes over. You keep the note in Obsidian 1 vault, everything dumps to the top: ideas, links, meeting fragments, half-thoughts. You never organize, you only dump. It all lives as plain markdown on your own disk, and that detail is the whole trick. Because now you point Claude Code at the vault folder, and it reads every line you’ve ever written. “What did I think about pricing in March.” “Find the 3 ideas I keep circling.” “What did I drop that deserves a second look.” It answers from YOUR notes, with quotes, in 15 seconds. Then once a week, 1 prompt closes the loop: read the last 7 days, surface the 5 entries worth pulling back up, flag anything that contradicts what I wrote a month ago. The model does the sinking and surfacing Karpathy did by hand, and the note stays alive instead of turning into a graveyard. Week 1 feels like nothing. Week 4 you hit the first “I already solved this in January.” Month 3 you consult your past self more than Google. Most second brains die in 11 days under 40 plugins and 200 folders. This one is 1 file and a loop, and it compounds because dumping takes 0 discipline. Notion stores what you thought. This thing argues back.

West Lord

24,679 次观看 • 15 天前

Model-Free Reinforcement Learning (MFRL) has been alluring, especially with supercharged compute with physics on GPU. However, the methods use 0-th order gradients, and are often not the best optimizers. Can we do better than PPO in continuous control for robotics? Turns out yes! 🥳 tl;dr: Faster, better RL than PPO in continuous control 💪 The answer lies in using more information from the simulation. We are juicing the simulation on GPU as it is, why not use it for gradients as well? This has been a driving question in a series of our works. We first studied this problem in ICLR 2022 paper on Short Horizon Actor Critic Naive gradient based methods are stuck in local minima and have exploding/vanishing gradients. SHAC solved this problem truncated rollouts and model based value estimation, where the model is Differentiable Sim. This boosted sample efficiency and wall-clock time immensely especially in high dimensional systems such as humanoids Yet, given enough compute PPO often caught up. Our follow up paper on on Adaptive Horizon Actor Critic at ICML 2024 discovers the cause and provides a fix. However, we find that even when given ground-truth dynamics, not all gradients are useful due to sample error. 1st-Order Model-Based Reinforcement Learning methods employing differentiable simulation provide gradients with reduced variance but are susceptible to bias in scenarios involving stiff dynamics, such as physical contact. We find that back-propagating through contact and long trajectories drastically reduces gradient accuracy. Using this insight, we propose AHAC to dynamically adapt its roll-out horizon to avoid differentiating through stiff contact. AHAC is a first-order model-based RL algorithm that learns high-dimensional tasks in minutes (wall clock) and outperforms PPO by 40%, even in the limit of data provided to PPO. This work is led by Ignat Georgiev alongside Krishnan Srinivasan, Jie Xu, Eric Heiden and ample assistance from warp team at NVIDIA Robotics (Miles Macklin)

Model-Free Reinforcement Learning (MFRL) has been alluring, especially with supercharged compute with physics on GPU. However, the methods use 0-th order gradients, and are often not the best optimizers. Can we do better than PPO in continuous control for robotics? Turns out yes! 🥳 tl;dr: Faster, better RL than PPO in continuous control 💪 The answer lies in using more information from the simulation. We are juicing the simulation on GPU as it is, why not use it for gradients as well? This has been a driving question in a series of our works. We first studied this problem in ICLR 2022 paper on Short Horizon Actor Critic Naive gradient based methods are stuck in local minima and have exploding/vanishing gradients. SHAC solved this problem truncated rollouts and model based value estimation, where the model is Differentiable Sim. This boosted sample efficiency and wall-clock time immensely especially in high dimensional systems such as humanoids Yet, given enough compute PPO often caught up. Our follow up paper on on Adaptive Horizon Actor Critic at ICML 2024 discovers the cause and provides a fix. However, we find that even when given ground-truth dynamics, not all gradients are useful due to sample error. 1st-Order Model-Based Reinforcement Learning methods employing differentiable simulation provide gradients with reduced variance but are susceptible to bias in scenarios involving stiff dynamics, such as physical contact. We find that back-propagating through contact and long trajectories drastically reduces gradient accuracy. Using this insight, we propose AHAC to dynamically adapt its roll-out horizon to avoid differentiating through stiff contact. AHAC is a first-order model-based RL algorithm that learns high-dimensional tasks in minutes (wall clock) and outperforms PPO by 40%, even in the limit of data provided to PPO. This work is led by Ignat Georgiev alongside Krishnan Srinivasan, Jie Xu, Eric Heiden and ample assistance from warp team at NVIDIA Robotics (Miles Macklin)

Animesh Garg

52,300 次观看 • 2 年前

Part Two (2/2): In order to graze or pierce the top of the ear, the round would had to have been fired directly in front or behind, not at a perpendicular angle. First, You can know this is part of plan by Laws and Orders and how this sets up the finale. Second, you can read articles of the bullet whizzing by President Trump at the angle and trajectory reported by CNN… So, where did that bullet thump in the same time sequence when the other two bullets hit their targets sooner at the same angle and trajectory? Also, CNN says the snipers returned fire… There were 5 rounds in return, not in Burst Rounds, all single 1-1-1-1-1. The return fire by the “snipers” sounds exactly the same as the first 3 rounds fired at PDJT. And they sound NOTHING like an AR-15 whatsoever. Then there’s one lone round 11 seconds after the 5 rounds returned by “snipers” who are all around and had the target identified as the media says “they were told by Secret Service to hold their fire until the shooter engaged”… PLUS those on the roof behind President Trump were closer to the location of the gunman than PDJT, so, 350-450 feet away. So, are you telling me that trained snipers needed more than one round for a lone shooter? Are you telling me that trained snipers with an identified target were told to “stand down” until engagement? Are you tellling me that that trained snipers with an identified target allowed a lone round 11 seconds after return engagement? And/or needed another round 11 seconds after 5 returned on a lone wolf shooter? You clearly know NOTHING about snipers if you say ‘yes’ on one, a combo, or all of those. That alone doesn’t even need to follow up with the location, sound, angle, trajectory, but just because… A perpendicularly shot given the evidence… wouldn’t be the top of the ear pierced. This was ALL for those who have no clue what’s going on by Laws and Orders the past 7.5 years of a Military Occupation and COOP. This sets up the grand finale of ALL the evidence that will be brought against the dudes going to GITMO for the normies. Normies and “so-called Patriots” who reacted to this: stop listening to people who don’t know the Plan which is a World Special Operation with multi-faceted layers of operations taking place all outlined in Military and Federal Laws and Orders that clearly outline and define a Military Occupation and Continuity of Operations Plan. Which means stop listening to people who just want some likes, shares, and to “be the face” that you run to when it’s all about them and not an Oath. The evidence on this is clear and in order to prove it you needed: Location Angle Trajectory Velocity Weapon Style Caliber The same as “politics” requires: Laws Orders Acts Bills Codes Or you just have a lot of 🗣️💨💩 Trust the Plan, we are on the home stretch 💯🐂🇺🇸

Part Two (2/2): In order to graze or pierce the top of the ear, the round would had to have been fired directly in front or behind, not at a perpendicular angle. First, You can know this is part of plan by Laws and Orders and how this sets up the finale. Second, you can read articles of the bullet whizzing by President Trump at the angle and trajectory reported by CNN… So, where did that bullet thump in the same time sequence when the other two bullets hit their targets sooner at the same angle and trajectory? Also, CNN says the snipers returned fire… There were 5 rounds in return, not in Burst Rounds, all single 1-1-1-1-1. The return fire by the “snipers” sounds exactly the same as the first 3 rounds fired at PDJT. And they sound NOTHING like an AR-15 whatsoever. Then there’s one lone round 11 seconds after the 5 rounds returned by “snipers” who are all around and had the target identified as the media says “they were told by Secret Service to hold their fire until the shooter engaged”… PLUS those on the roof behind President Trump were closer to the location of the gunman than PDJT, so, 350-450 feet away. So, are you telling me that trained snipers needed more than one round for a lone shooter? Are you telling me that trained snipers with an identified target were told to “stand down” until engagement? Are you tellling me that that trained snipers with an identified target allowed a lone round 11 seconds after return engagement? And/or needed another round 11 seconds after 5 returned on a lone wolf shooter? You clearly know NOTHING about snipers if you say ‘yes’ on one, a combo, or all of those. That alone doesn’t even need to follow up with the location, sound, angle, trajectory, but just because… A perpendicularly shot given the evidence… wouldn’t be the top of the ear pierced. This was ALL for those who have no clue what’s going on by Laws and Orders the past 7.5 years of a Military Occupation and COOP. This sets up the grand finale of ALL the evidence that will be brought against the dudes going to GITMO for the normies. Normies and “so-called Patriots” who reacted to this: stop listening to people who don’t know the Plan which is a World Special Operation with multi-faceted layers of operations taking place all outlined in Military and Federal Laws and Orders that clearly outline and define a Military Occupation and Continuity of Operations Plan. Which means stop listening to people who just want some likes, shares, and to “be the face” that you run to when it’s all about them and not an Oath. The evidence on this is clear and in order to prove it you needed: Location Angle Trajectory Velocity Weapon Style Caliber The same as “politics” requires: Laws Orders Acts Bills Codes Or you just have a lot of 🗣️💨💩 Trust the Plan, we are on the home stretch 💯🐂🇺🇸

Derek Johnson

271,656 次观看 • 2 年前