Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Writing a CUDA kernel requires a shift in mental model. Instead of one fast processor, you manage thousands of tiny threads. Here is the code and the logic explained for Matrix Multiplication.

Ashutosh Maheshwari

35,305 subscribers

189,009 görüntüleme • 7 ay önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Fast matrix multiplication on GPUs has traditionally meant wrestling with threads, shared memory, and low-level hardware details. This webinar explores how NVIDIA’s CUDA Tile model—and its Julia port, cuTile.jl—makes high-performance GPU programming more accessible. Join Dr. Andy Terrel of NVIDIA and Dr. Tim Besard of JuliaHub to see real examples across linear algebra, AI inference, and HPC. Register here - #JuliaLang #GPUProgramming #CUDA #HPC #AIInfrastructure

Fast matrix multiplication on GPUs has traditionally meant wrestling with threads, shared memory, and low-level hardware details. This webinar explores how NVIDIA’s CUDA Tile model—and its Julia port, cuTile.jl—makes high-performance GPU programming more accessible. Join Dr. Andy Terrel of NVIDIA and Dr. Tim Besard of JuliaHub to see real examples across linear algebra, AI inference, and HPC. Register here - #JuliaLang #GPUProgramming #CUDA #HPC #AIInfrastructure

JuliaHub

10,255 görüntüleme • 2 ay önce

My dear software developers (and anyone who’s interested in the future of code search): I have crawled through depths of hell to bring you, one of the more important foundational piece of programming: fast, the most accurate, index free, and correct code search Here is a real time code search on leaked claude code sources, linux kernel 100k files, and chromium repo 500k files

My dear software developers (and anyone who’s interested in the future of code search): I have crawled through depths of hell to bring you, one of the more important foundational piece of programming: fast, the most accurate, index free, and correct code search Here is a real time code search on leaked claude code sources, linux kernel 100k files, and chromium repo 500k files

Dmitriy Kovalenko

190,044 görüntüleme • 3 ay önce

timelapse attempt #2 >day 42 of unemployment >writing the naive cuda flashattention kernel >private sidequest progress >starting a blogpost >still haven't book the housing for asia in 3days

timelapse attempt #2 >day 42 of unemployment >writing the naive cuda flashattention kernel >private sidequest progress >starting a blogpost >still haven't book the housing for asia in 3days

alexine 🏴‍☠️

558,960 görüntüleme • 7 ay önce

How many of the big ideas of the past 15 years of AI are downstream of hardware constraints? The big hardware story over that period is that logic has become way cheaper than data transfer. Stacking huge numbers of matrix multiplies was perfect for this hardware regime, because matrix multiplication is logic-intensive but requires less data transfer. And so we got matmul-heavy deep learning. It's interesting to think about what AI would look like in a world where these costs didn't diverge so much.

How many of the big ideas of the past 15 years of AI are downstream of hardware constraints? The big hardware story over that period is that logic has become way cheaper than data transfer. Stacking huge numbers of matrix multiplies was perfect for this hardware regime, because matrix multiplication is logic-intensive but requires less data transfer. And so we got matmul-heavy deep learning. It's interesting to think about what AI would look like in a world where these costs didn't diverge so much.

Dwarkesh Patel

59,458 görüntüleme • 10 gün önce

The Matrix (1999) had one of the smartest mystery-driven releases ever. The marketing barely explained anything, trailers asked a single question, and pushed curiosity instead of answers.

The Matrix (1999) had one of the smartest mystery-driven releases ever. The marketing barely explained anything, trailers asked a single question, and pushed curiosity instead of answers.

cinesthetic.

20,244 görüntüleme • 6 ay önce

📁Mo Gawdat, former Google X executive, says AI is no longer just writing code, it is correcting human mathematics. After 56 years using the same matrix multiplication method, AI realized the approach was flawed. It did not optimize software. It invented new math. The result was a 23% performance boost and the removal of hundreds of millions of dollars in costs and energy use.

📁Mo Gawdat, former Google X executive, says AI is no longer just writing code, it is correcting human mathematics. After 56 years using the same matrix multiplication method, AI realized the approach was flawed. It did not optimize software. It invented new math. The result was a 23% performance boost and the removal of hundreds of millions of dollars in costs and energy use.

Jon Hernandez

136,394 görüntüleme • 6 ay önce

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels that are much faster than existing CUDA kernels commonly used in production. We believe that fundamentally, AI systems can and should be as resource-efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! We are excited to publish our paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition. We also release a dataset of over 17,000 verified CUDA kernels produced by The AI CUDA Engineer. Paper: Kernel Archive Webpage: HuggingFace Dataset: The AI CUDA Engineer utilizes evolutionary LLM-driven code optimization to autonomously improve the runtime of machine learning operations. Our system is not only able to convert PyTorch code into CUDA kernels, but through the use of evolution, it can also optimize the runtime performance of CUDA kernels, fuse multiple operations, and even discover novel solutions for writing efficient CUDA operations by learning from past innovations! We believe The AI CUDA Engineer opens a new era of AI-driven acceleration of AI and automated inference time optimization. We (Robert Lange, Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru) are excited to continue Sakana AI's mission of leveraging AI to improve AI.

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels that are much faster than existing CUDA kernels commonly used in production. We believe that fundamentally, AI systems can and should be as resource-efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! We are excited to publish our paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition. We also release a dataset of over 17,000 verified CUDA kernels produced by The AI CUDA Engineer. Paper: Kernel Archive Webpage: HuggingFace Dataset: The AI CUDA Engineer utilizes evolutionary LLM-driven code optimization to autonomously improve the runtime of machine learning operations. Our system is not only able to convert PyTorch code into CUDA kernels, but through the use of evolution, it can also optimize the runtime performance of CUDA kernels, fuse multiple operations, and even discover novel solutions for writing efficient CUDA operations by learning from past innovations! We believe The AI CUDA Engineer opens a new era of AI-driven acceleration of AI and automated inference time optimization. We (Robert Lange, Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru) are excited to continue Sakana AI's mission of leveraging AI to improve AI.

Sakana AI

1,149,339 görüntüleme • 1 yıl önce

This is my favorite clip of the new Elon pod. He opens up saying xAI struggles with memory usage/bandwidth and CUDA kernel optimization (matmul, attention, MoE, etc). If you are good kernel or performance engineering in general, you should apply. Steer the world in a better direction.

This is my favorite clip of the new Elon pod. He opens up saying xAI struggles with memory usage/bandwidth and CUDA kernel optimization (matmul, attention, MoE, etc). If you are good kernel or performance engineering in general, you should apply. Steer the world in a better direction.

Elliot Arledge

163,922 görüntüleme • 5 ay önce

PyTorch core engineer at Meta turned CUDA kernel writing into a sport in 13 minutes - better than $1500 GPU programming bootcamps. profile the kernel -> find the bottleneck -> rewrite -> benchmark -> merge the winning code into PyTorch. That loop is how the open community now beats hand-tuned vendor kernels. GPU MODE community + KernelBot competition + winning kernel merged into the framework - that's the stack. Watch it, then steal the loop below.

h100envy

35,094 görüntüleme • 4 gün önce

Cursor CEO Michael Truell on the future of writing code: “ Our goal with Cursor is to invent a new type of programming.” “It looks like a world where you have a representation of the logic of your software that does look more like English.” “You can imagine kind of an evolution of programming language towards pseudocode. You have written down the logic of the software, and you can edit that at a high level.” “It won't be the impenetrable millions of lines of code, it'll instead be something that's much terser and easier to understand and easier to navigate.” Source: Michael Truell (CEO Cursor) with Lenny Rachitsky on Lenny's Podcast

Cursor CEO Michael Truell on the future of writing code: “ Our goal with Cursor is to invent a new type of programming.” “It looks like a world where you have a representation of the logic of your software that does look more like English.” “You can imagine kind of an evolution of programming language towards pseudocode. You have written down the logic of the software, and you can edit that at a high level.” “It won't be the impenetrable millions of lines of code, it'll instead be something that's much terser and easier to understand and easier to navigate.” Source: Michael Truell (CEO Cursor) with Lenny Rachitsky on Lenny's Podcast

a16z

812,581 görüntüleme • 8 ay önce

Cursor CEO Michael Truell on the future of writing code: "Our goal with Cursor is to invent a new type of programming." "It looks like a world where you have a representation of the logic of your software that does look more like English." "You can imagine kind of an evolution of programming language towards pseudocode. You have written down the logic of the software, and you can edit that at a high level." "It won't be the impenetrable millions of lines of code, it'll instead be something that's much terser and easier to understand and easier to navigate." Michael Truell with Lenny Rachitsky on Lenny's Podcast

Cursor CEO Michael Truell on the future of writing code: "Our goal with Cursor is to invent a new type of programming." "It looks like a world where you have a representation of the logic of your software that does look more like English." "You can imagine kind of an evolution of programming language towards pseudocode. You have written down the logic of the software, and you can edit that at a high level." "It won't be the impenetrable millions of lines of code, it'll instead be something that's much terser and easier to understand and easier to navigate." Michael Truell with Lenny Rachitsky on Lenny's Podcast

a16z

1,011,754 görüntüleme • 18 gün önce

The largest advancement of the CUDA platform since its creation in 2006 is here 👀 Introducing CUDA Tile, a tile-based programming model that provides the ability to write algorithms at a higher level and abstract away the details of specialized hardware, such as tensor cores. Read the technical blog 👉

The largest advancement of the CUDA platform since its creation in 2006 is here 👀 Introducing CUDA Tile, a tile-based programming model that provides the ability to write algorithms at a higher level and abstract away the details of specialized hardware, such as tensor cores. Read the technical blog 👉

NVIDIA AI Developer

244,885 görüntüleme • 7 ay önce

Luminal ( is creating PyTorch for Production – an ML compiler that generates blazingly fast CUDA kernels and makes deploying to production one line of code. Congrats on the launch, Jake Stevens, Joe Fioti, and Matthew Gunton!

Luminal ( is creating PyTorch for Production – an ML compiler that generates blazingly fast CUDA kernels and makes deploying to production one line of code. Congrats on the launch, Jake Stevens, Joe Fioti, and Matthew Gunton!

Y Combinator

98,496 görüntüleme • 11 ay önce

Problem-solving is one of the most important skills that you need as a web dev, but most videos don't cover it. I wanted to show you what troubleshooting is like with a Frontend Mentor challenge, instead of writing perfect code the first time around.

Problem-solving is one of the most important skills that you need as a web dev, but most videos don't cover it. I wanted to show you what troubleshooting is like with a Frontend Mentor challenge, instead of writing perfect code the first time around.

Jess Chan | Coder Coder

10,871 görüntüleme • 11 ay önce

Neo in The Matrix (1999) becoming The One is one of the hardest aura shifts ever put on screen. The second Neo starts seeing the code and casually stops bullets, the entire movie suddenly starts moving at his pace instead.

Neo in The Matrix (1999) becoming The One is one of the hardest aura shifts ever put on screen. The second Neo starts seeing the code and casually stops bullets, the entire movie suddenly starts moving at his pace instead.

cinesthetic.

229,774 görüntüleme • 1 ay önce

Stop writing utility classes. Utility classes are a sign that you are writing procedural code. You should avoid them. Instead, write a real object model. For example, an email is not a String... It's an Email. Gautier - 🤘

Stop writing utility classes. Utility classes are a sign that you are writing procedural code. You should avoid them. Instead, write a real object model. For example, an email is not a String... It's an Email. Gautier - 🤘

Gautier 💙

19,509 görüntüleme • 1 yıl önce

At Anthropic AI is writing code, even designing the next versions of itself, Dario Amodei says. This loop is closing fast, and the speed of progress is both exciting and a little unsettling. Its accelerating.

At Anthropic AI is writing code, even designing the next versions of itself, Dario Amodei says. This loop is closing fast, and the speed of progress is both exciting and a little unsettling. Its accelerating.

Chubby♨️

20,504 görüntüleme • 5 ay önce

A new mechanism to manage starvation instead of ending it, and Israeli helicopters open fire on thousands of starving people crushed in the chaos. This is how the first day of allowed aid entry into Gaza looked like..

A new mechanism to manage starvation instead of ending it, and Israeli helicopters open fire on thousands of starving people crushed in the chaos. This is how the first day of allowed aid entry into Gaza looked like..

Euro-Med Monitor

29,933 görüntüleme • 1 yıl önce

i wanted to start a consistency challenge, then I came across the Sui Ghana Content Challenge and here I am. 20 days of dropping content: Videos Threads etc… A bit late today, but this is Day 1 I explained Web3 and why Sui. See you in the next one. #SuiContentChallenge

i wanted to start a consistency challenge, then I came across the Sui Ghana Content Challenge and here I am. 20 days of dropping content: Videos Threads etc… A bit late today, but this is Day 1 I explained Web3 and why Sui. See you in the next one. #SuiContentChallenge

Creative Bee🐝

14,814 görüntüleme • 9 ay önce