Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

New short course: Attention in Transformers: Concepts and Code in PyTorch. Last week we released a course on how LLM transformers work. This week, go deeper and learn about the technical ideas behind the attention mechanism, and see how to code it in PyTorch. This course is built with... Joshua Starmer, Founder and CEO of StatQuest. The attention mechanism was a breakthrough that led to transformers, the architecture powering large language models like ChatGPT. Transformers, introduced in the 2017 paper: "Attention is All You Need" by Viswani and others, took off because of its highly scalable design. In this course, you’ll learn how the attention mechanism, a key element of transformer-based LLMs, works and implement it in PyTorch. You'll develop deep intuition about building reliable, functional, and scalable AI applications. What you will do: - Understand the evolution of the attention mechanism, a key breakthrough that led to transformers. - Learn the relationships between word embeddings, positional embeddings, and attention. - Learn about the Query, Key, and Value matrices, and how to produce and use them in attention. - Walk through the math required to calculate self-attention and masked self-attention to learn why and how they work. - Understand the difference between self-attention and masked self-attention and how one is used in the encoder to build context-aware embeddings and the other is used in the decoder for generative outputs. - Learn the details of the encoder-decoder architecture, cross-attention, and multi-head attention and how they are all incorporated into a transformer. - Use PyTorch to code a class that implements self-attention, masked self-attention, and multi-head attention. There're lots of exciting technical details in this course. Please sign up here:show more

Andrew Ng

1,745,050 subscribers

132,285 views • 1 year ago •via X (Twitter)

Education Science & Technology

Anya Rossi• Live Now

Private livecam show

8 Comments

Tenkaizen1 year ago

Always diving deeper, I see

RedDeer.Games IR2 years ago

🌈✨ Transform playtime into learning time with The Smurfs on Nintendo Switch! 🎓🎮 Perfect for curious minds. Buy now: [The Smurfs: Learn and Play]( #Smurfs #NintendoSwitch

offwiththegridpartdeux1 year ago

this is great. also, today i learned my neural net associates the ceo of statquest with stephen dorff as deacon frost in the 1998 movie blade

jay_131 year ago

Thank you so much @AndrewYNg and @joshuastarmer

Data & Analytics1 year ago

@AndrewYNg, this course sounds like a fantastic opportunity to deepen our understanding of LLM transformers! Mastering the attention mechanism is essential for any aspiring developer. Happy learning! 🌟 #TechEducation

Malik Hasnain1 year ago

Looks fascinating! Breaking down attention mechanisms with code sounds like a great way to truly grasp the concept. Definitely checking it out!

M IQBAL1 year ago

Context modeling is more than just a feature in transformers; it serves as the foundation for modern AI scalability

Glitch Elvis1 year ago

Keeping track of attention now? Sounds captivating!

Related Videos

Last week, we launched "Attention in Transformers: Concepts and Code in PyTorch" instructed by Joshua Starmer! In this course, you'll: ✅ Learn how the attention mechanism in LLMs helps convert base token embeddings into rich context-aware embeddings. ✅ Understand the Query, Key, and Value matrices, what they are for, how to produce them, and how to use them in attention. ✅ Learn the difference between self-attention, masked self-attention, and cross-attention, and how multi-head attention scales the algorithm. 🔗 Enroll for free:

Last week, we launched "Attention in Transformers: Concepts and Code in PyTorch" instructed by Joshua Starmer! In this course, you'll: ✅ Learn how the attention mechanism in LLMs helps convert base token embeddings into rich context-aware embeddings. ✅ Understand the Query, Key, and Value matrices, what they are for, how to produce them, and how to use them in attention. ✅ Learn the difference between self-attention, masked self-attention, and cross-attention, and how multi-head attention scales the algorithm. 🔗 Enroll for free:

DeepLearning.AI

36,832 views • 1 year ago

Announcing How Transformer LLMs Work, created with Jay Alammar and Maarten Grootendorst, co-authors of the beautifully illustrated book, “Hands-On Large Language Models.” This course offers a deep dive into the inner workings of the transformer architecture that powers large language models (LLMs). The transformer architecture revolutionized generative AI; in fact, the "GPT" in ChatGPT stands for "Generative Pre-Trained Transformer." Originally introduced in the Google Brain team's groundbreaking 2017 paper "Attention Is All You Need," by Vaswani and others, transformers were a highly scalable model for machine translation tasks. Variants of this architecture now power today’s LLMs such as those from OpenAI, Google, Meta, Cohere, Anthropic and DeepSeek. In this course, you’ll learn in detail how LLMs process text. You'll also work through code examples that illustrate that transformer's individual components. In details, you’ll learn: - How the representation of language has evolved, from Bag-of-Words to Word2Vec embeddings to the transformer architecture that captures a word's meanings taking into account the context of other words in the input. - How inputs are broken down into tokens before they are sent to the language model. - The details of a transformer's main stages: Tokenization and embedding, the stack of transformer blocks, and the language model head. - The inner workings of the transformer block, including attention, which calculates relevance scores, and the feedforward layer, which incorporates stored information learned in training. - How cached calculations make transformers faster. - Some of the most recent ideas in the latest models such as Mixture-of-Experts (MoE) which uses multiple sub-models and a router on each layer to improve the quality of LLMs. By the end of this course, you’ll have a deep understanding of how LLMs actually process text and be able to read through papers describing the latest models and understand the details. Gaining this intuition will improve your approach to building LLM applications. Please sign up here:

Announcing How Transformer LLMs Work, created with Jay Alammar and Maarten Grootendorst, co-authors of the beautifully illustrated book, “Hands-On Large Language Models.” This course offers a deep dive into the inner workings of the transformer architecture that powers large language models (LLMs). The transformer architecture revolutionized generative AI; in fact, the "GPT" in ChatGPT stands for "Generative Pre-Trained Transformer." Originally introduced in the Google Brain team's groundbreaking 2017 paper "Attention Is All You Need," by Vaswani and others, transformers were a highly scalable model for machine translation tasks. Variants of this architecture now power today’s LLMs such as those from OpenAI, Google, Meta, Cohere, Anthropic and DeepSeek. In this course, you’ll learn in detail how LLMs process text. You'll also work through code examples that illustrate that transformer's individual components. In details, you’ll learn: - How the representation of language has evolved, from Bag-of-Words to Word2Vec embeddings to the transformer architecture that captures a word's meanings taking into account the context of other words in the input. - How inputs are broken down into tokens before they are sent to the language model. - The details of a transformer's main stages: Tokenization and embedding, the stack of transformer blocks, and the language model head. - The inner workings of the transformer block, including attention, which calculates relevance scores, and the feedforward layer, which incorporates stored information learned in training. - How cached calculations make transformers faster. - Some of the most recent ideas in the latest models such as Mixture-of-Experts (MoE) which uses multiple sub-models and a router on each layer to improve the quality of LLMs. By the end of this course, you’ll have a deep understanding of how LLMs actually process text and be able to read through papers describing the latest models and understand the details. Gaining this intuition will improve your approach to building LLM applications. Please sign up here:

Andrew Ng

259,920 views • 1 year ago

LLMs can make sense of retrieved context because of how transformers work. In one of the lessons from the Retrieval Augmented Generation (RAG) course, we unpack how LLMs process augmented prompts using token embeddings, positional vectors, and multi-head attention. Understanding these internals helps you design more reliable and efficient RAG systems. Watch the breakdown and keep learning how to build production-ready RAG systems in this course, taught by Zain:

LLMs can make sense of retrieved context because of how transformers work. In one of the lessons from the Retrieval Augmented Generation (RAG) course, we unpack how LLMs process augmented prompts using token embeddings, positional vectors, and multi-head attention. Understanding these internals helps you design more reliable and efficient RAG systems. Watch the breakdown and keep learning how to build production-ready RAG systems in this course, taught by Zain:

DeepLearning.AI

11,500 views • 1 year ago

building large language models from scratch by Sebastian Raschka was a great chance for me to sit down and study again all the LLM basics > token and positional embeddings > self-attention and what QKV is about > causal & multi-head attention studying llms and how they work can seem overwhelming at first. but once you taste how good it feels to learn these things intuitively there's no going back. I shared the resources and my notes on my repo and I hope it's a motivation if you want to start as well or recap.

building large language models from scratch by Sebastian Raschka was a great chance for me to sit down and study again all the LLM basics > token and positional embeddings > self-attention and what QKV is about > causal & multi-head attention studying llms and how they work can seem overwhelming at first. but once you taste how good it feels to learn these things intuitively there's no going back. I shared the resources and my notes on my repo and I hope it's a motivation if you want to start as well or recap.

ℏεsam

49,292 views • 1 year ago

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

Andrew Ng

120,728 views • 2 months ago

The next chapter about transformers is up on YouTube, digging into the attention mechanism: The model works with vectors representing tokens (think words), and this is the mechanism that allows those vectors to take in meaning from context.

The next chapter about transformers is up on YouTube, digging into the attention mechanism: The model works with vectors representing tokens (think words), and this is the mechanism that allows those vectors to take in meaning from context.

Grant Sanderson

810,356 views • 2 years ago

Met a guy making $1.6 million/year as an LLM engineer. I asked him how he learned LLMs from scratch. He sent me the exact video that got him in. A 1 hour course on how LLMs actually work. He shows how transformers inside LLMs like ChatGPT & Claude are actually built. I watched it last night. Halfway through, I realized LLM architecture is way simpler than they make it look. Bookmark this and read the article below. • 00:00 - LLM foundations • 04:21 - LLM tokenization • 05:43 - LLMs vector embeddings • 22:16 - attention mechanism of LLM • 43:42 - LLM multi head attention

Met a guy making $1.6 million/year as an LLM engineer. I asked him how he learned LLMs from scratch. He sent me the exact video that got him in. A 1 hour course on how LLMs actually work. He shows how transformers inside LLMs like ChatGPT & Claude are actually built. I watched it last night. Halfway through, I realized LLM architecture is way simpler than they make it look. Bookmark this and read the article below. • 00:00 - LLM foundations • 04:21 - LLM tokenization • 05:43 - LLMs vector embeddings • 22:16 - attention mechanism of LLM • 43:42 - LLM multi head attention

Roan

78,783 views • 7 days ago

New video! The attention mechanism is well known for its use in Transformers. But where does it come from? It's origins lie in fixing a strange problems of RNNs. Watch the video to learn about it!

New video! The attention mechanism is well known for its use in Transformers. But where does it come from? It's origins lie in fixing a strange problems of RNNs. Watch the video to learn about it!

Vivek Verma

59,476 views • 2 years ago

[Self-Attention] by Hand ✍️ Self-attention is what enables LLMs to understand context. How does it work? This exercise demonstrates how to calculate a 6-3 attention head by hand. Note that if we have two instances of this, we get 6-6 attention (i.e., multi-head attention, n=2). -- 𝗚𝗼𝗮𝗹 -- Transform [6D Features 🟧] to [3D Attention Weighted Features 🟦] -- 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 -- [1] Given ↳ A set of 4 feature vectors (6-D): x1,x2,x3,x4 [2] Query, Key, Value ↳ Multiply features x's with linear transformation matrices WQ, WK, and WV, to obtain query vectors (q1,q2,q3,q4), key vectors (k1,k2,k3,k4), and value vectors (v1,v2,v3,v4). ↳ "Self" refers to the fact that both queries and keys are derived from the same set of features. [3] 🟪 Prepare for MatMul ↳ Copy query vectors ↳ Copy the transpose of key vectors [4] 🟪 MatMul ↳ Multiply K^T and Q ↳ This is equivalent to taking dot product between every pair of query and key vectors. ↳ The purpose is to use dot product as an estimate of the "matching score" between every key-value pair. ↳ This estimate makes sense because dot product is the numerator of Cosine Similarity between two vectors. [5] 🟨 Scale ↳ Scale each element by the square root of dk, which is the dimension of key vectors (dk=3). ↳ The purpose is to normalize the impact of the dk on matching scores, even if we scale dk to 32, 64, or 128. ↳ To simplify hand calculation, we approximate [ □/sqrt(3) ] with [ floor(□/2) ]. [6] 🟩 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [7] 🟩 Softmax: ∑ ↳ Sum across each column [8] 🟩 Softmax: 1 / sum ↳ For each column, divide each element by the column sum ↳ The purpose is normalize each column so that the numbers sum to 1. In other words, each column is a probability distribution of attention, and we have four of them. ↳ The result is the Attention Weight Matrix (A) (yellow) [9] 🟦 MatMul ↳ Multiply the value vectors (Vs) with the Attention Weight Matrix (A) ↳ The results are the attention weighted features Zs. ↳ They are fed to the position-wise feed forward network in the next layer.

[Self-Attention] by Hand ✍️ Self-attention is what enables LLMs to understand context. How does it work? This exercise demonstrates how to calculate a 6-3 attention head by hand. Note that if we have two instances of this, we get 6-6 attention (i.e., multi-head attention, n=2). -- 𝗚𝗼𝗮𝗹 -- Transform [6D Features 🟧] to [3D Attention Weighted Features 🟦] -- 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 -- [1] Given ↳ A set of 4 feature vectors (6-D): x1,x2,x3,x4 [2] Query, Key, Value ↳ Multiply features x's with linear transformation matrices WQ, WK, and WV, to obtain query vectors (q1,q2,q3,q4), key vectors (k1,k2,k3,k4), and value vectors (v1,v2,v3,v4). ↳ "Self" refers to the fact that both queries and keys are derived from the same set of features. [3] 🟪 Prepare for MatMul ↳ Copy query vectors ↳ Copy the transpose of key vectors [4] 🟪 MatMul ↳ Multiply K^T and Q ↳ This is equivalent to taking dot product between every pair of query and key vectors. ↳ The purpose is to use dot product as an estimate of the "matching score" between every key-value pair. ↳ This estimate makes sense because dot product is the numerator of Cosine Similarity between two vectors. [5] 🟨 Scale ↳ Scale each element by the square root of dk, which is the dimension of key vectors (dk=3). ↳ The purpose is to normalize the impact of the dk on matching scores, even if we scale dk to 32, 64, or 128. ↳ To simplify hand calculation, we approximate [ □/sqrt(3) ] with [ floor(□/2) ]. [6] 🟩 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [7] 🟩 Softmax: ∑ ↳ Sum across each column [8] 🟩 Softmax: 1 / sum ↳ For each column, divide each element by the column sum ↳ The purpose is normalize each column so that the numbers sum to 1. In other words, each column is a probability distribution of attention, and we have four of them. ↳ The result is the Attention Weight Matrix (A) (yellow) [9] 🟦 MatMul ↳ Multiply the value vectors (Vs) with the Attention Weight Matrix (A) ↳ The results are the attention weighted features Zs. ↳ They are fed to the position-wise feed forward network in the next layer.

Tom Yeh

101,010 views • 2 years ago

New short course: Build Long-Context AI Apps with Jamba. Learn about state space models (SSMs), which have emerged as an alternative to transformers! Specifically, Jamba is a hybrid transformer-Mamba architecture that combines strengths of the transformer with ideas from SSMs. This course is built with AI21 Labs and taught by Chen Wang and Chen Almagor. The transformer architecture is computationally expensive when handling very long input contexts. But there's an alternative called Mamba, a selective state space model that can process very long contexts with a much lower computational cost. However, researchers found that the pure Mamba architecture underperforms in understanding the context, and gives lower-quality responses. To overcome this, AI21 developed the Jamba model, which combines Mamba's computational efficiency with the transformer's attention mechanism to help with the output quality. In this course, you’ll learn about how state space models, and Jamba, work. You’ll also learn how to prompt Jamba, use it to process long documents, and build long-context RAG apps. - Learn how Jamba combines transformer and state space model architectures to achieve high performance and quality - Use the AI21 SDK, with an example of prompting over a large 200k-token annual financial report of Nvidia - Use Jamba for tool-calling, with hands-on examples from calling simple arithmetic calculations to a function that returns quarterly company financial reports. - Learn how training for long context is done, and the metrics used for its evaluation - Create a RAG app using the AI21 Conversational RAG tool and build your own RAG pipeline that uses Jamba and LangChain. By the end of this course, you'll learn how to build applications that can handle context as long as an entire book. Please sign up here:

New short course: Build Long-Context AI Apps with Jamba. Learn about state space models (SSMs), which have emerged as an alternative to transformers! Specifically, Jamba is a hybrid transformer-Mamba architecture that combines strengths of the transformer with ideas from SSMs. This course is built with AI21 Labs and taught by Chen Wang and Chen Almagor. The transformer architecture is computationally expensive when handling very long input contexts. But there's an alternative called Mamba, a selective state space model that can process very long contexts with a much lower computational cost. However, researchers found that the pure Mamba architecture underperforms in understanding the context, and gives lower-quality responses. To overcome this, AI21 developed the Jamba model, which combines Mamba's computational efficiency with the transformer's attention mechanism to help with the output quality. In this course, you’ll learn about how state space models, and Jamba, work. You’ll also learn how to prompt Jamba, use it to process long documents, and build long-context RAG apps. - Learn how Jamba combines transformer and state space model architectures to achieve high performance and quality - Use the AI21 SDK, with an example of prompting over a large 200k-token annual financial report of Nvidia - Use Jamba for tool-calling, with hands-on examples from calling simple arithmetic calculations to a function that returns quarterly company financial reports. - Learn how training for long context is done, and the metrics used for its evaluation - Create a RAG app using the AI21 Conversational RAG tool and build your own RAG pipeline that uses Jamba and LangChain. By the end of this course, you'll learn how to build applications that can handle context as long as an entire book. Please sign up here:

Andrew Ng

77,792 views • 1 year ago

The code of GSPN #CVPR2025 is released! We proposed a new sqrt(N) complexity attention mechanism, which enables efficient high resolution image generation. We can generate 8k images with 42x speed up compared to self-attention in StableDiffusionXL! Code: Paper:

The code of GSPN #CVPR2025 is released! We proposed a new sqrt(N) complexity attention mechanism, which enables efficient high resolution image generation. We can generate 8k images with 42x speed up compared to self-attention in StableDiffusionXL! Code: Paper:

Xiaolong Wang

354,887 views • 1 year ago

This is about the attention to detail and the amount of work that this Chiefs team puts in #PMSLive #ChiefsKingdom

This is about the attention to detail and the amount of work that this Chiefs team puts in #PMSLive #ChiefsKingdom

Pat McAfee

508,505 views • 1 year ago

“We don’t stop. Get that clear in your mind. We don’t stop…” An old one from Ange Postecoglou’s time at Celtic…but a good one! Attention Intensity Intent It’s always the same. The best coach’s coach attention, intensity, and intent. In this case, Coach Ange is directing player attention to the intensity of action…in a tone that signifies a high intent. It’s always the same… It’s always the same… My days are spent helping coaches to help players engage, learn, and perform by creating sessions and use behaviours that direct attention, optimise intensity, and heighten intent. Engagement Learning Performance It’s the same in the workplace. I get to work with incredibly ambitious people in corporate environments who want to optimise their performance moments by regularly finding their High Performance Mindset through attention, intensity, intent. In control In charge How about you? Are you in control and in charge of yourself during your performance moments? Are you able to direct your attention appropriately? Are you able to manage your intensity and experience of intent? It’s relevant for all domains, all challenges, all tasks, all sports, all jobs…

“We don’t stop. Get that clear in your mind. We don’t stop…” An old one from Ange Postecoglou’s time at Celtic…but a good one! Attention Intensity Intent It’s always the same. The best coach’s coach attention, intensity, and intent. In this case, Coach Ange is directing player attention to the intensity of action…in a tone that signifies a high intent. It’s always the same… It’s always the same… My days are spent helping coaches to help players engage, learn, and perform by creating sessions and use behaviours that direct attention, optimise intensity, and heighten intent. Engagement Learning Performance It’s the same in the workplace. I get to work with incredibly ambitious people in corporate environments who want to optimise their performance moments by regularly finding their High Performance Mindset through attention, intensity, intent. In control In charge How about you? Are you in control and in charge of yourself during your performance moments? Are you able to direct your attention appropriately? Are you able to manage your intensity and experience of intent? It’s relevant for all domains, all challenges, all tasks, all sports, all jobs…

Daniel Abrahams

34,460 views • 11 months ago

Zhilin at GTC: Introducing Attention Residuals Learning selective memory, rather than mechanically accumulating everything, is the beauty of attention. Many of you have probably read Attention Is All You Need, the 2017 Transformer paper that brought “human-like” attention into the model’s field of view. From that point on, models no longer simply read everything in a mechanical way. Instead, they began to develop a sense of what matters more and what matters less across the text, choosing to retain the more important information. Recently, Kimi applied this idea of attention to the temporal dimension, then rotated it 90 degrees into the model’s depth dimension. This allows the model to have attention not only over time, but also throughout the process of information transmission across layers—giving it a more intelligent way to understand and process information.

Zhilin at GTC: Introducing Attention Residuals Learning selective memory, rather than mechanically accumulating everything, is the beauty of attention. Many of you have probably read Attention Is All You Need, the 2017 Transformer paper that brought “human-like” attention into the model’s field of view. From that point on, models no longer simply read everything in a mechanical way. Instead, they began to develop a sense of what matters more and what matters less across the text, choosing to retain the more important information. Recently, Kimi applied this idea of attention to the temporal dimension, then rotated it 90 degrees into the model’s depth dimension. This allows the model to have attention not only over time, but also throughout the process of information transmission across layers—giving it a more intelligent way to understand and process information.

Kimi.ai

115,162 views • 4 months ago

Kalman Filtering + Deep Attention Models = LOVE ❤️ #CoRL '23 paper: we identify a link between Kalman filters and neural attention mechanisms. Idea: the Kalman-gain is a form of attention! No need to tweak parameters. Check website and code: ASU Robotics

Kalman Filtering + Deep Attention Models = LOVE ❤️ #CoRL '23 paper: we identify a link between Kalman filters and neural attention mechanisms. Idea: the Kalman-gain is a form of attention! No need to tweak parameters. Check website and code: ASU Robotics

Heni Ben Amor

19,054 views • 2 years ago

6⃣ Main Different Modes of Attention + explanations Can you notice when you are spontaneously in each of these types of attention throughout the week? Most think attention mode 2. is the only way to be concentrated = sustaining engagement with desired phenomena for long periods of time. But actually you can learn to maintain each mode of attention and concentrate in different ways - knowing this radically opens up one’s potential in meditation practice and access to many more states of mind and insights. 🤯 ¡This animation visually illustrates different ways in which attention can behave, but note that these different modes are relevant across the senses (not just with seeing)!

6⃣ Main Different Modes of Attention + explanations Can you notice when you are spontaneously in each of these types of attention throughout the week? Most think attention mode 2. is the only way to be concentrated = sustaining engagement with desired phenomena for long periods of time. But actually you can learn to maintain each mode of attention and concentrate in different ways - knowing this radically opens up one’s potential in meditation practice and access to many more states of mind and insights. 🤯 ¡This animation visually illustrates different ways in which attention can behave, but note that these different modes are relevant across the senses (not just with seeing)!

Roger This

12,801 views • 1 year ago

Attention is a lookup. Each token builds a query, compares it against every key in the sequence, and pulls value vectors weighted by the match. Stack that 96 layers deep and you get a frontier model. Video covers the full pipeline: Q/K/V, attention scores, encoder blocks.

Attention is a lookup. Each token builds a query, compares it against every key in the sequence, and pulls value vectors weighted by the match. Stack that 96 layers deep and you get a frontier model. Video covers the full pipeline: Q/K/V, attention scores, encoder blocks.

tetsuo

105,518 views • 1 month ago

🐻 “I came to get some attention” 🐨 “I will give you a lot of attention” and then they start the most random dialogue ever because namjoon learnt how to understand taehyung’s language and how to make him happy 🥺

🐻 “I came to get some attention” 🐨 “I will give you a lot of attention” and then they start the most random dialogue ever because namjoon learnt how to understand taehyung’s language and how to make him happy 🥺

sophie

60,022 views • 10 months ago

Over 4 years into our journey bridging Convolutions and Transformers, we introduce Generalized Neighborhood Attention—Multi-dimensional Sparse Attention at the Speed of Light: A collaboration with the best minds in AI and HPC. 🐝🟩🟧 Georgia Tech Computing NVIDIA

Over 4 years into our journey bridging Convolutions and Transformers, we introduce Generalized Neighborhood Attention—Multi-dimensional Sparse Attention at the Speed of Light: A collaboration with the best minds in AI and HPC. 🐝🟩🟧 Georgia Tech Computing NVIDIA

Humphrey Shi

29,163 views • 1 year ago