Loading video...

Video Failed to Load

Go Home

New short course: Attention in Transformers: Concepts and Code in PyTorch. Last week we released a course on how LLM transformers work. This week, go deeper and learn about the technical ideas behind the attention mechanism, and see how to code it in PyTorch. This course is built with...

132,135 views • 1 year ago •via X (Twitter)

8 Comments

Tenkaizen's profile picture
Tenkaizen1 year ago

Always diving deeper, I see

RedDeer.Games IR's profile picture
RedDeer.Games IR1 year ago

🌈✨ Transform playtime into learning time with The Smurfs on Nintendo Switch! 🎓🎮 Perfect for curious minds. Buy now: [The Smurfs: Learn and Play]( #Smurfs #NintendoSwitch

offwiththegridpartdeux's profile picture
offwiththegridpartdeux1 year ago

this is great. also, today i learned my neural net associates the ceo of statquest with stephen dorff as deacon frost in the 1998 movie blade

jay_13's profile picture
jay_131 year ago

Thank you so much @AndrewYNg and @joshuastarmer

Data & Analytics's profile picture
Data & Analytics1 year ago

@AndrewYNg, this course sounds like a fantastic opportunity to deepen our understanding of LLM transformers! Mastering the attention mechanism is essential for any aspiring developer. Happy learning! 🌟 #TechEducation

Malik Hasnain's profile picture
Malik Hasnain1 year ago

Looks fascinating! Breaking down attention mechanisms with code sounds like a great way to truly grasp the concept. Definitely checking it out!

M IQBAL's profile picture
M IQBAL1 year ago

Context modeling is more than just a feature in transformers; it serves as the foundation for modern AI scalability

Glitch Elvis's profile picture
Glitch Elvis1 year ago

Keeping track of attention now? Sounds captivating!

Related Videos

Announcing How Transformer LLMs Work, created with Jay Alammar and Maarten Grootendorst, co-authors of the beautifully illustrated book, “Hands-On Large Language Models.” This course offers a deep dive into the inner workings of the transformer architecture that powers large language models (LLMs). The transformer architecture revolutionized generative AI; in fact, the "GPT" in ChatGPT stands for "Generative Pre-Trained Transformer." Originally introduced in the Google Brain team's groundbreaking 2017 paper "Attention Is All You Need," by Vaswani and others, transformers were a highly scalable model for machine translation tasks. Variants of this architecture now power today’s LLMs such as those from OpenAI, Google, Meta, Cohere, Anthropic and DeepSeek. In this course, you’ll learn in detail how LLMs process text. You'll also work through code examples that illustrate that transformer's individual components. In details, you’ll learn: - How the representation of language has evolved, from Bag-of-Words to Word2Vec embeddings to the transformer architecture that captures a word's meanings taking into account the context of other words in the input. - How inputs are broken down into tokens before they are sent to the language model. - The details of a transformer's main stages: Tokenization and embedding, the stack of transformer blocks, and the language model head. - The inner workings of the transformer block, including attention, which calculates relevance scores, and the feedforward layer, which incorporates stored information learned in training. - How cached calculations make transformers faster. - Some of the most recent ideas in the latest models such as Mixture-of-Experts (MoE) which uses multiple sub-models and a router on each layer to improve the quality of LLMs. By the end of this course, you’ll have a deep understanding of how LLMs actually process text and be able to read through papers describing the latest models and understand the details. Gaining this intuition will improve your approach to building LLM applications. Please sign up here:

Andrew Ng

252,150 views • 1 year ago

[Self-Attention] by Hand ✍️ Self-attention is what enables LLMs to understand context. How does it work? This exercise demonstrates how to calculate a 6-3 attention head by hand. Note that if we have two instances of this, we get 6-6 attention (i.e., multi-head attention, n=2). -- 𝗚𝗼𝗮𝗹 -- Transform [6D Features 🟧] to [3D Attention Weighted Features 🟦] -- 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 -- [1] Given ↳ A set of 4 feature vectors (6-D): x1,x2,x3,x4 [2] Query, Key, Value ↳ Multiply features x's with linear transformation matrices WQ, WK, and WV, to obtain query vectors (q1,q2,q3,q4), key vectors (k1,k2,k3,k4), and value vectors (v1,v2,v3,v4). ↳ "Self" refers to the fact that both queries and keys are derived from the same set of features. [3] 🟪 Prepare for MatMul ↳ Copy query vectors ↳ Copy the transpose of key vectors [4] 🟪 MatMul ↳ Multiply K^T and Q ↳ This is equivalent to taking dot product between every pair of query and key vectors. ↳ The purpose is to use dot product as an estimate of the "matching score" between every key-value pair. ↳ This estimate makes sense because dot product is the numerator of Cosine Similarity between two vectors. [5] 🟨 Scale ↳ Scale each element by the square root of dk, which is the dimension of key vectors (dk=3). ↳ The purpose is to normalize the impact of the dk on matching scores, even if we scale dk to 32, 64, or 128. ↳ To simplify hand calculation, we approximate [ □/sqrt(3) ] with [ floor(□/2) ]. [6] 🟩 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [7] 🟩 Softmax: ∑ ↳ Sum across each column [8] 🟩 Softmax: 1 / sum ↳ For each column, divide each element by the column sum ↳ The purpose is normalize each column so that the numbers sum to 1. In other words, each column is a probability distribution of attention, and we have four of them. ↳ The result is the Attention Weight Matrix (A) (yellow) [9] 🟦 MatMul ↳ Multiply the value vectors (Vs) with the Attention Weight Matrix (A) ↳ The results are the attention weighted features Zs. ↳ They are fed to the position-wise feed forward network in the next layer.

Tom Yeh

101,010 views • 2 years ago

New short course: Build Long-Context AI Apps with Jamba. Learn about state space models (SSMs), which have emerged as an alternative to transformers! Specifically, Jamba is a hybrid transformer-Mamba architecture that combines strengths of the transformer with ideas from SSMs. This course is built with AI21 Labs and taught by Chen Wang and Chen Almagor. The transformer architecture is computationally expensive when handling very long input contexts. But there's an alternative called Mamba, a selective state space model that can process very long contexts with a much lower computational cost. However, researchers found that the pure Mamba architecture underperforms in understanding the context, and gives lower-quality responses. To overcome this, AI21 developed the Jamba model, which combines Mamba's computational efficiency with the transformer's attention mechanism to help with the output quality. In this course, you’ll learn about how state space models, and Jamba, work. You’ll also learn how to prompt Jamba, use it to process long documents, and build long-context RAG apps. - Learn how Jamba combines transformer and state space model architectures to achieve high performance and quality - Use the AI21 SDK, with an example of prompting over a large 200k-token annual financial report of Nvidia - Use Jamba for tool-calling, with hands-on examples from calling simple arithmetic calculations to a function that returns quarterly company financial reports. - Learn how training for long context is done, and the metrics used for its evaluation - Create a RAG app using the AI21 Conversational RAG tool and build your own RAG pipeline that uses Jamba and LangChain. By the end of this course, you'll learn how to build applications that can handle context as long as an entire book. Please sign up here:

Andrew Ng

77,792 views • 1 year ago