Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Last week, we launched "Attention in Transformers: Concepts and Code in PyTorch" instructed by Joshua Starmer! In this course, you'll: ✅ Learn how the attention mechanism in LLMs helps convert base token embeddings into rich context-aware embeddings. ✅ Understand the Query, Key, and Value matrices, what they are for,... show more

DeepLearning.AI

334,589 subscribers

36,832 Aufrufe • vor 1 Jahr •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

5 Kommentare

Profilbild von ryan yang

ryan yangvor 1 Jahr

@joshuastarmer attentions key for AI success. many struggle with clear strategies and legacy tech. we help bridge that gap.

Profilbild von CodeRabbit

CodeRabbitvor 1 Jahr

AI-first pull request reviewer with context-aware feedback, line-by-line code suggestions, and real-time chat.

Profilbild von Abhivendra Singh

Abhivendra Singhvor 1 Jahr

@joshuastarmer The launch of "Attention in Transformers" is a pivotal moment for anyone diving into deep learning. Understanding attention mechanisms is essential; it's where AI begins to truly understand context.

Profilbild von L8NTLABS

L8NTLABSvor 1 Jahr

@joshuastarmer Attention mechanisms in LLMs are a game changer, been experimenting with them in my own projects. The way they help convert base token embeddings into rich context-aware embeddings is pure magic. Definitely going to check out this course, thanks for sharing @DeepLearningAI

Profilbild von Jarsal_Firahel

Jarsal_Firahelvor 1 Jahr

@joshuastarmer You mean THE Josh Starmer from StatQuest ?! StaaatQuest 🎵

Ähnliche Videos

New short course: Attention in Transformers: Concepts and Code in PyTorch. Last week we released a course on how LLM transformers work. This week, go deeper and learn about the technical ideas behind the attention mechanism, and see how to code it in PyTorch. This course is built with Joshua Starmer, Founder and CEO of StatQuest. The attention mechanism was a breakthrough that led to transformers, the architecture powering large language models like ChatGPT. Transformers, introduced in the 2017 paper: "Attention is All You Need" by Viswani and others, took off because of its highly scalable design. In this course, you’ll learn how the attention mechanism, a key element of transformer-based LLMs, works and implement it in PyTorch. You'll develop deep intuition about building reliable, functional, and scalable AI applications. What you will do: - Understand the evolution of the attention mechanism, a key breakthrough that led to transformers. - Learn the relationships between word embeddings, positional embeddings, and attention. - Learn about the Query, Key, and Value matrices, and how to produce and use them in attention. - Walk through the math required to calculate self-attention and masked self-attention to learn why and how they work. - Understand the difference between self-attention and masked self-attention and how one is used in the encoder to build context-aware embeddings and the other is used in the decoder for generative outputs. - Learn the details of the encoder-decoder architecture, cross-attention, and multi-head attention and how they are all incorporated into a transformer. - Use PyTorch to code a class that implements self-attention, masked self-attention, and multi-head attention. There're lots of exciting technical details in this course. Please sign up here:

New short course: Attention in Transformers: Concepts and Code in PyTorch. Last week we released a course on how LLM transformers work. This week, go deeper and learn about the technical ideas behind the attention mechanism, and see how to code it in PyTorch. This course is built with Joshua Starmer, Founder and CEO of StatQuest. The attention mechanism was a breakthrough that led to transformers, the architecture powering large language models like ChatGPT. Transformers, introduced in the 2017 paper: "Attention is All You Need" by Viswani and others, took off because of its highly scalable design. In this course, you’ll learn how the attention mechanism, a key element of transformer-based LLMs, works and implement it in PyTorch. You'll develop deep intuition about building reliable, functional, and scalable AI applications. What you will do: - Understand the evolution of the attention mechanism, a key breakthrough that led to transformers. - Learn the relationships between word embeddings, positional embeddings, and attention. - Learn about the Query, Key, and Value matrices, and how to produce and use them in attention. - Walk through the math required to calculate self-attention and masked self-attention to learn why and how they work. - Understand the difference between self-attention and masked self-attention and how one is used in the encoder to build context-aware embeddings and the other is used in the decoder for generative outputs. - Learn the details of the encoder-decoder architecture, cross-attention, and multi-head attention and how they are all incorporated into a transformer. - Use PyTorch to code a class that implements self-attention, masked self-attention, and multi-head attention. There're lots of exciting technical details in this course. Please sign up here:

Andrew Ng

132,285 Aufrufe • vor 1 Jahr

LLMs can make sense of retrieved context because of how transformers work. In one of the lessons from the Retrieval Augmented Generation (RAG) course, we unpack how LLMs process augmented prompts using token embeddings, positional vectors, and multi-head attention. Understanding these internals helps you design more reliable and efficient RAG systems. Watch the breakdown and keep learning how to build production-ready RAG systems in this course, taught by Zain:

LLMs can make sense of retrieved context because of how transformers work. In one of the lessons from the Retrieval Augmented Generation (RAG) course, we unpack how LLMs process augmented prompts using token embeddings, positional vectors, and multi-head attention. Understanding these internals helps you design more reliable and efficient RAG systems. Watch the breakdown and keep learning how to build production-ready RAG systems in this course, taught by Zain:

DeepLearning.AI

11,500 Aufrufe • vor 1 Jahr

building large language models from scratch by Sebastian Raschka was a great chance for me to sit down and study again all the LLM basics > token and positional embeddings > self-attention and what QKV is about > causal & multi-head attention studying llms and how they work can seem overwhelming at first. but once you taste how good it feels to learn these things intuitively there's no going back. I shared the resources and my notes on my repo and I hope it's a motivation if you want to start as well or recap.

building large language models from scratch by Sebastian Raschka was a great chance for me to sit down and study again all the LLM basics > token and positional embeddings > self-attention and what QKV is about > causal & multi-head attention studying llms and how they work can seem overwhelming at first. but once you taste how good it feels to learn these things intuitively there's no going back. I shared the resources and my notes on my repo and I hope it's a motivation if you want to start as well or recap.

ℏεsam

49,292 Aufrufe • vor 1 Jahr

Sparse attention (MoBA/NSA) trains faster & beats full attention in key tasks. But we’ve had no idea how they truly work…until now. 🔍 We reverse-engineered them to uncover: - Novel attention patterns - Hidden "attention sinks" - Better performance - And more A 🧵… ~1/8~

Sparse attention (MoBA/NSA) trains faster & beats full attention in key tasks. But we’ve had no idea how they truly work…until now. 🔍 We reverse-engineered them to uncover: - Novel attention patterns - Hidden "attention sinks" - Better performance - And more A 🧵… ~1/8~

Tilde

59,480 Aufrufe • vor 1 Jahr

[Self-Attention] by Hand ✍️ Self-attention is what enables LLMs to understand context. How does it work? This exercise demonstrates how to calculate a 6-3 attention head by hand. Note that if we have two instances of this, we get 6-6 attention (i.e., multi-head attention, n=2). -- 𝗚𝗼𝗮𝗹 -- Transform [6D Features 🟧] to [3D Attention Weighted Features 🟦] -- 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 -- [1] Given ↳ A set of 4 feature vectors (6-D): x1,x2,x3,x4 [2] Query, Key, Value ↳ Multiply features x's with linear transformation matrices WQ, WK, and WV, to obtain query vectors (q1,q2,q3,q4), key vectors (k1,k2,k3,k4), and value vectors (v1,v2,v3,v4). ↳ "Self" refers to the fact that both queries and keys are derived from the same set of features. [3] 🟪 Prepare for MatMul ↳ Copy query vectors ↳ Copy the transpose of key vectors [4] 🟪 MatMul ↳ Multiply K^T and Q ↳ This is equivalent to taking dot product between every pair of query and key vectors. ↳ The purpose is to use dot product as an estimate of the "matching score" between every key-value pair. ↳ This estimate makes sense because dot product is the numerator of Cosine Similarity between two vectors. [5] 🟨 Scale ↳ Scale each element by the square root of dk, which is the dimension of key vectors (dk=3). ↳ The purpose is to normalize the impact of the dk on matching scores, even if we scale dk to 32, 64, or 128. ↳ To simplify hand calculation, we approximate [ □/sqrt(3) ] with [ floor(□/2) ]. [6] 🟩 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [7] 🟩 Softmax: ∑ ↳ Sum across each column [8] 🟩 Softmax: 1 / sum ↳ For each column, divide each element by the column sum ↳ The purpose is normalize each column so that the numbers sum to 1. In other words, each column is a probability distribution of attention, and we have four of them. ↳ The result is the Attention Weight Matrix (A) (yellow) [9] 🟦 MatMul ↳ Multiply the value vectors (Vs) with the Attention Weight Matrix (A) ↳ The results are the attention weighted features Zs. ↳ They are fed to the position-wise feed forward network in the next layer.

[Self-Attention] by Hand ✍️ Self-attention is what enables LLMs to understand context. How does it work? This exercise demonstrates how to calculate a 6-3 attention head by hand. Note that if we have two instances of this, we get 6-6 attention (i.e., multi-head attention, n=2). -- 𝗚𝗼𝗮𝗹 -- Transform [6D Features 🟧] to [3D Attention Weighted Features 🟦] -- 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 -- [1] Given ↳ A set of 4 feature vectors (6-D): x1,x2,x3,x4 [2] Query, Key, Value ↳ Multiply features x's with linear transformation matrices WQ, WK, and WV, to obtain query vectors (q1,q2,q3,q4), key vectors (k1,k2,k3,k4), and value vectors (v1,v2,v3,v4). ↳ "Self" refers to the fact that both queries and keys are derived from the same set of features. [3] 🟪 Prepare for MatMul ↳ Copy query vectors ↳ Copy the transpose of key vectors [4] 🟪 MatMul ↳ Multiply K^T and Q ↳ This is equivalent to taking dot product between every pair of query and key vectors. ↳ The purpose is to use dot product as an estimate of the "matching score" between every key-value pair. ↳ This estimate makes sense because dot product is the numerator of Cosine Similarity between two vectors. [5] 🟨 Scale ↳ Scale each element by the square root of dk, which is the dimension of key vectors (dk=3). ↳ The purpose is to normalize the impact of the dk on matching scores, even if we scale dk to 32, 64, or 128. ↳ To simplify hand calculation, we approximate [ □/sqrt(3) ] with [ floor(□/2) ]. [6] 🟩 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [7] 🟩 Softmax: ∑ ↳ Sum across each column [8] 🟩 Softmax: 1 / sum ↳ For each column, divide each element by the column sum ↳ The purpose is normalize each column so that the numbers sum to 1. In other words, each column is a probability distribution of attention, and we have four of them. ↳ The result is the Attention Weight Matrix (A) (yellow) [9] 🟦 MatMul ↳ Multiply the value vectors (Vs) with the Attention Weight Matrix (A) ↳ The results are the attention weighted features Zs. ↳ They are fed to the position-wise feed forward network in the next layer.

Tom Yeh

101,010 Aufrufe • vor 2 Jahren

Met a guy making $1.6 million/year as an LLM engineer. I asked him how he learned LLMs from scratch. He sent me the exact video that got him in. A 1 hour course on how LLMs actually work. He shows how transformers inside LLMs like ChatGPT & Claude are actually built. I watched it last night. Halfway through, I realized LLM architecture is way simpler than they make it look. Bookmark this and read the article below. • 00:00 - LLM foundations • 04:21 - LLM tokenization • 05:43 - LLMs vector embeddings • 22:16 - attention mechanism of LLM • 43:42 - LLM multi head attention

Met a guy making $1.6 million/year as an LLM engineer. I asked him how he learned LLMs from scratch. He sent me the exact video that got him in. A 1 hour course on how LLMs actually work. He shows how transformers inside LLMs like ChatGPT & Claude are actually built. I watched it last night. Halfway through, I realized LLM architecture is way simpler than they make it look. Bookmark this and read the article below. • 00:00 - LLM foundations • 04:21 - LLM tokenization • 05:43 - LLMs vector embeddings • 22:16 - attention mechanism of LLM • 43:42 - LLM multi head attention

Roan

78,709 Aufrufe • vor 7 Tagen

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

Andrew Ng

120,728 Aufrufe • vor 2 Monaten

Attention is a lookup. Each token builds a query, compares it against every key in the sequence, and pulls value vectors weighted by the match. Stack that 96 layers deep and you get a frontier model. Video covers the full pipeline: Q/K/V, attention scores, encoder blocks.

Attention is a lookup. Each token builds a query, compares it against every key in the sequence, and pulls value vectors weighted by the match. Stack that 96 layers deep and you get a frontier model. Video covers the full pipeline: Q/K/V, attention scores, encoder blocks.

tetsuo

105,518 Aufrufe • vor 1 Monat

What’s going on in Iran, how did we get here, and why you should pay attention. In collaboration with The Free Press

What’s going on in Iran, how did we get here, and why you should pay attention. In collaboration with The Free Press

Elica Le Bon الیکا‌ ل بن

246,477 Aufrufe • vor 6 Monaten

6⃣ Main Different Modes of Attention + explanations Can you notice when you are spontaneously in each of these types of attention throughout the week? Most think attention mode 2. is the only way to be concentrated = sustaining engagement with desired phenomena for long periods of time. But actually you can learn to maintain each mode of attention and concentrate in different ways - knowing this radically opens up one’s potential in meditation practice and access to many more states of mind and insights. 🤯 ¡This animation visually illustrates different ways in which attention can behave, but note that these different modes are relevant across the senses (not just with seeing)!

6⃣ Main Different Modes of Attention + explanations Can you notice when you are spontaneously in each of these types of attention throughout the week? Most think attention mode 2. is the only way to be concentrated = sustaining engagement with desired phenomena for long periods of time. But actually you can learn to maintain each mode of attention and concentrate in different ways - knowing this radically opens up one’s potential in meditation practice and access to many more states of mind and insights. 🤯 ¡This animation visually illustrates different ways in which attention can behave, but note that these different modes are relevant across the senses (not just with seeing)!

Roger This

12,801 Aufrufe • vor 1 Jahr

Where it all began... $CLOUT started his journey from the bottom and defeated all odds to become the most rich and famous meme in the world. Pay attention to our animated series, to learn how $CLOUT rose to fame!

Where it all began... $CLOUT started his journey from the bottom and defeated all odds to become the most rich and famous meme in the world. Pay attention to our animated series, to learn how $CLOUT rose to fame!

$CLOUT ON SOL

2,407,357 Aufrufe • vor 2 Jahren

“We don’t stop. Get that clear in your mind. We don’t stop…” An old one from Ange Postecoglou’s time at Celtic…but a good one! Attention Intensity Intent It’s always the same. The best coach’s coach attention, intensity, and intent. In this case, Coach Ange is directing player attention to the intensity of action…in a tone that signifies a high intent. It’s always the same… It’s always the same… My days are spent helping coaches to help players engage, learn, and perform by creating sessions and use behaviours that direct attention, optimise intensity, and heighten intent. Engagement Learning Performance It’s the same in the workplace. I get to work with incredibly ambitious people in corporate environments who want to optimise their performance moments by regularly finding their High Performance Mindset through attention, intensity, intent. In control In charge How about you? Are you in control and in charge of yourself during your performance moments? Are you able to direct your attention appropriately? Are you able to manage your intensity and experience of intent? It’s relevant for all domains, all challenges, all tasks, all sports, all jobs…

“We don’t stop. Get that clear in your mind. We don’t stop…” An old one from Ange Postecoglou’s time at Celtic…but a good one! Attention Intensity Intent It’s always the same. The best coach’s coach attention, intensity, and intent. In this case, Coach Ange is directing player attention to the intensity of action…in a tone that signifies a high intent. It’s always the same… It’s always the same… My days are spent helping coaches to help players engage, learn, and perform by creating sessions and use behaviours that direct attention, optimise intensity, and heighten intent. Engagement Learning Performance It’s the same in the workplace. I get to work with incredibly ambitious people in corporate environments who want to optimise their performance moments by regularly finding their High Performance Mindset through attention, intensity, intent. In control In charge How about you? Are you in control and in charge of yourself during your performance moments? Are you able to direct your attention appropriately? Are you able to manage your intensity and experience of intent? It’s relevant for all domains, all challenges, all tasks, all sports, all jobs…

Daniel Abrahams

34,460 Aufrufe • vor 11 Monaten

Last night, I sat next to a man who desperately wanted attention. How did he try to get that attention? By claiming the reason for the wildfires in LA is that California firefighters aren’t white enough.

Last night, I sat next to a man who desperately wanted attention. How did he try to get that attention? By claiming the reason for the wildfires in LA is that California firefighters aren’t white enough.

Congresswoman Jasmine Crockett

2,993,316 Aufrufe • vor 1 Jahr

How many people have to be murdered in broad daylight for the Liberals and NDP to pay attention⁉️ 🤯🤯🤯 #CallAnElectionNOW

How many people have to be murdered in broad daylight for the Liberals and NDP to pay attention⁉️ 🤯🤯🤯 #CallAnElectionNOW

Michelle Leahy Ferreri

34,272 Aufrufe • vor 1 Jahr

Over 4 years into our journey bridging Convolutions and Transformers, we introduce Generalized Neighborhood Attention—Multi-dimensional Sparse Attention at the Speed of Light: A collaboration with the best minds in AI and HPC. 🐝🟩🟧 Georgia Tech Computing NVIDIA

Over 4 years into our journey bridging Convolutions and Transformers, we introduce Generalized Neighborhood Attention—Multi-dimensional Sparse Attention at the Speed of Light: A collaboration with the best minds in AI and HPC. 🐝🟩🟧 Georgia Tech Computing NVIDIA

Humphrey Shi

29,163 Aufrufe • vor 1 Jahr

Self-Attention by hand ✍️ Excel ~ I designed this exercise for students to practice the QKV math. I also created a medium and a large version to show how the attention matrix grows quadratically as the sequence gets longer. 👇Join the 'AI Math' community. Download xlsx.

Self-Attention by hand ✍️ Excel ~ I designed this exercise for students to practice the QKV math. I also created a medium and a large version to show how the attention matrix grows quadratically as the sequence gets longer. 👇Join the 'AI Math' community. Download xlsx.

Tom Yeh

125,658 Aufrufe • vor 1 Jahr