正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

timelapse #89 (12.5 hrs): - got single gpu nvfp4 gemm @ 5.2 PFLOPS working reliably (sm100) - solved ampere/hopper gemm kernel from scratch issues - split kernel optimization chapter into: - gemv, softmax, layernorm, topK, gemm (fp32 only cuda cores) - gemm (tf32, fp16, bf16, fp8, fp4) - cutting... show more

Elliot Arledge

39,426 subscribers

60,596 次观看 • 9 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

♊️ : เดี๋ยวเขาจะมีเปิด GEMM HOUSE 🏠 เป็นคาเฟ่ GMMTV จิงๆ เขาทำให้ผมแหละเพราะชื่อ GEMM 🤣 ฝากไปแวะไปได้นะค้าบบ 💖 GEMINIFOURTH VN FANMEET #GeminiFourthFMinVietnam #Gemini_NT #เจมีไนน์ gemini_nt

♊️ : เดี๋ยวเขาจะมีเปิด GEMM HOUSE 🏠 เป็นคาเฟ่ GMMTV จิงๆ เขาทำให้ผมแหละเพราะชื่อ GEMM 🤣 ฝากไปแวะไปได้นะค้าบบ 💖 GEMINIFOURTH VN FANMEET #GeminiFourthFMinVietnam #Gemini_NT #เจมีไนน์ gemini_nt

Geminint_Family

26,842 次观看 • 15 天前

happy bday to this gemm who else is still up chasing or being chased

happy bday to this gemm who else is still up chasing or being chased

nourahae yoohoo

141,500 次观看 • 1 年前

timelapse #83 (22 hrs): - it was very easy to dive super deep into anything i needed to (this is what i focused on today because not all days are like this) - finding the grok code fast 1 + grok 4 for deep thinking and verification combo to be super useful in cursor. speed was solid - hard to imagine myself spending many more mental clock cycles in a 24 hr period - had to pull out qwen3-next’s gated deltanet + linear attention from bleeding edge hf transformers to begin implementing a multi-gpu fp8 trainer from scratch. this is so damn bleeding edge and i underestimated how much effort this has and will require lol - lots of diet coke and oats - shipped the template which the core chapters of my book will be built on: - all im missing now is flash attention 1/2 mastery (fa2 tmrw), intuition on making topk faster (for arbitrary row length), what i should and shouldnt teach in cutlass/cute, hopper/blackwell gemm kernel mastery (down to fp4) — shoutout to Pranjal for making this easier for me. his blog post is amazing - caught up w/ Mati Roy - im feeling great mentally but not so good physically as im writing this and about to pass out

timelapse #83 (22 hrs): - it was very easy to dive super deep into anything i needed to (this is what i focused on today because not all days are like this) - finding the grok code fast 1 + grok 4 for deep thinking and verification combo to be super useful in cursor. speed was solid - hard to imagine myself spending many more mental clock cycles in a 24 hr period - had to pull out qwen3-next’s gated deltanet + linear attention from bleeding edge hf transformers to begin implementing a multi-gpu fp8 trainer from scratch. this is so damn bleeding edge and i underestimated how much effort this has and will require lol - lots of diet coke and oats - shipped the template which the core chapters of my book will be built on: - all im missing now is flash attention 1/2 mastery (fa2 tmrw), intuition on making topk faster (for arbitrary row length), what i should and shouldnt teach in cutlass/cute, hopper/blackwell gemm kernel mastery (down to fp4) — shoutout to Pranjal for making this easier for me. his blog post is amazing - caught up w/ Mati Roy - im feeling great mentally but not so good physically as im writing this and about to pass out

Elliot Arledge

2,255,154 次观看 • 9 个月前

timelapse #90 (8 hrs): - lots 2 days of work after 8xB200 node randomly turned off and disk deleted (i think i can bring it back in an hour or so) - getting easy to kick off nvfp4 gemm templates for book - switching between cursor instances like a madman - my definition of focus is when you are ruthlessly killing the AMAZING ideas popping up in your head that are truly distracting you and focus on what matters instead (managing ideally 2 things at once) - further architecting on distributed chapter since im including a basic example of pipeline parallelism - did a space on hyperengineering - 2 meetings with editors - catchup with a fr8 participant

timelapse #90 (8 hrs): - lots 2 days of work after 8xB200 node randomly turned off and disk deleted (i think i can bring it back in an hour or so) - getting easy to kick off nvfp4 gemm templates for book - switching between cursor instances like a madman - my definition of focus is when you are ruthlessly killing the AMAZING ideas popping up in your head that are truly distracting you and focus on what matters instead (managing ideally 2 things at once) - further architecting on distributed chapter since im including a basic example of pipeline parallelism - did a space on hyperengineering - 2 meetings with editors - catchup with a fr8 participant

Elliot Arledge

18,186 次观看 • 9 个月前

PyTorch core engineer at Meta turned CUDA kernel writing into a sport in 13 minutes - better than $1500 GPU programming bootcamps. profile the kernel -> find the bottleneck -> rewrite -> benchmark -> merge the winning code into PyTorch. That loop is how the open community now beats hand-tuned vendor kernels. GPU MODE community + KernelBot competition + winning kernel merged into the framework - that's the stack. Watch it, then steal the loop below.

h100envy

35,148 次观看 • 4 天前

Sanctions Hit Linux Kernel, Russian Programmers Banned Biden's Executive Order 14071, forbids Russians from working with or using GPL'd software made in the USA. And that includes the Linux Kernel.

Sanctions Hit Linux Kernel, Russian Programmers Banned Biden's Executive Order 14071, forbids Russians from working with or using GPL'd software made in the USA. And that includes the Linux Kernel.

The Lunduke Journal

19,412 次观看 • 1 年前

timelapse attempt #2 >day 42 of unemployment >writing the naive cuda flashattention kernel >private sidequest progress >starting a blogpost >still haven't book the housing for asia in 3days

timelapse attempt #2 >day 42 of unemployment >writing the naive cuda flashattention kernel >private sidequest progress >starting a blogpost >still haven't book the housing for asia in 3days

alexine 🏴‍☠️

558,960 次观看 • 7 个月前

timelapse #72 (7 hrs): - back in Canada and seriously couldn’t think of taking a break (im having so much fun all day just dumping my heart into making my work the highest quality) - setup new raspberry 5 to get the consistent Timelapse’s going (and run some simple background tasks) - very deep cuda book working session (using zed IDE) - figured out how to im going to articulate the hardest kernel optimizations to my readers - more research into evolution of Nvidia tensor cores over the years and what they compile down to for each architecture - steak dinner w/ family - went for ice cream with a friend - spaces w/ Adrian Dittmann

timelapse #72 (7 hrs): - back in Canada and seriously couldn’t think of taking a break (im having so much fun all day just dumping my heart into making my work the highest quality) - setup new raspberry 5 to get the consistent Timelapse’s going (and run some simple background tasks) - very deep cuda book working session (using zed IDE) - figured out how to im going to articulate the hardest kernel optimizations to my readers - more research into evolution of Nvidia tensor cores over the years and what they compile down to for each architecture - steak dinner w/ family - went for ice cream with a friend - spaces w/ Adrian Dittmann

Elliot Arledge

34,964 次观看 • 10 个月前

This is my favorite clip of the new Elon pod. He opens up saying xAI struggles with memory usage/bandwidth and CUDA kernel optimization (matmul, attention, MoE, etc). If you are good kernel or performance engineering in general, you should apply. Steer the world in a better direction.

This is my favorite clip of the new Elon pod. He opens up saying xAI struggles with memory usage/bandwidth and CUDA kernel optimization (matmul, attention, MoE, etc). If you are good kernel or performance engineering in general, you should apply. Steer the world in a better direction.

Elliot Arledge

163,922 次观看 • 5 个月前

Two days ago, Deepseek surprised everyone with an "undefined-behavior" PTX optimization speeding up particular ML workloads on a Hopper NVIDIA GPU Kernel. Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100.

Two days ago, Deepseek surprised everyone with an "undefined-behavior" PTX optimization speeding up particular ML workloads on a Hopper NVIDIA GPU Kernel. Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100.

LaurieWired

228,591 次观看 • 1 年前

# FreeRTOS Kernel Teardown!! Working on porting the FreeRTOS kernel from scratch on a new target. Got the scheduler to kick off and switch between two tasks. It later runs into a crash... I am debugging that at the moment. I am going to dive into the details of the FreeRTOS Kernel implementation. The scheduler is a fancy version of what we hand coded as part of the "Cortex-M CPU - Programming a Round Robin Scheduler". It is available as youtube playlist - The FreeRTOS teardown and exploration will be included in the Firmware Engineering Bundle of courses: Use REBOOT50 at the checkout to get a 50% off.

# FreeRTOS Kernel Teardown!! Working on porting the FreeRTOS kernel from scratch on a new target. Got the scheduler to kick off and switch between two tasks. It later runs into a crash... I am debugging that at the moment. I am going to dive into the details of the FreeRTOS Kernel implementation. The scheduler is a fancy version of what we hand coded as part of the "Cortex-M CPU - Programming a Round Robin Scheduler". It is available as youtube playlist - The FreeRTOS teardown and exploration will be included in the Firmware Engineering Bundle of courses: Use REBOOT50 at the checkout to get a 50% off.

Piyush Itankar

14,420 次观看 • 1 年前

Luminal fuses entire models into a single GPU kernel, automatically. Let's talk about why this matters for inference at the speed of light:

Luminal fuses entire models into a single GPU kernel, automatically. Let's talk about why this matters for inference at the speed of light:

Joe Fioti

24,348 次观看 • 5 个月前

Teaching Microsoft employees how to implement a portable sleep: - C - absolutely from scratch - not even libc - the only dependency is The Kernel - some inline assembly magic included

Teaching Microsoft employees how to implement a portable sleep: - C - absolutely from scratch - not even libc - the only dependency is The Kernel - some inline assembly magic included

Valentin Ignatev

149,077 次观看 • 6 个月前

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels that are much faster than existing CUDA kernels commonly used in production. We believe that fundamentally, AI systems can and should be as resource-efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! We are excited to publish our paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition. We also release a dataset of over 17,000 verified CUDA kernels produced by The AI CUDA Engineer. Paper: Kernel Archive Webpage: HuggingFace Dataset: The AI CUDA Engineer utilizes evolutionary LLM-driven code optimization to autonomously improve the runtime of machine learning operations. Our system is not only able to convert PyTorch code into CUDA kernels, but through the use of evolution, it can also optimize the runtime performance of CUDA kernels, fuse multiple operations, and even discover novel solutions for writing efficient CUDA operations by learning from past innovations! We believe The AI CUDA Engineer opens a new era of AI-driven acceleration of AI and automated inference time optimization. We (Robert Lange, Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru) are excited to continue Sakana AI's mission of leveraging AI to improve AI.

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels that are much faster than existing CUDA kernels commonly used in production. We believe that fundamentally, AI systems can and should be as resource-efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! We are excited to publish our paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition. We also release a dataset of over 17,000 verified CUDA kernels produced by The AI CUDA Engineer. Paper: Kernel Archive Webpage: HuggingFace Dataset: The AI CUDA Engineer utilizes evolutionary LLM-driven code optimization to autonomously improve the runtime of machine learning operations. Our system is not only able to convert PyTorch code into CUDA kernels, but through the use of evolution, it can also optimize the runtime performance of CUDA kernels, fuse multiple operations, and even discover novel solutions for writing efficient CUDA operations by learning from past innovations! We believe The AI CUDA Engineer opens a new era of AI-driven acceleration of AI and automated inference time optimization. We (Robert Lange, Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru) are excited to continue Sakana AI's mission of leveraging AI to improve AI.

Sakana AI

1,149,339 次观看 • 1 年前

Do you feel the awakening? It’s happening all over the world. Tonight in LA. Tomorrow in DC. For Charlie. For Jesus. For Revival. “Very truly I tell you, unless a kernel of wheat falls to the ground and dies, it remains only a single seed. But if it dies, it produces many seeds.” John 12:24

Do you feel the awakening? It’s happening all over the world. Tonight in LA. Tomorrow in DC. For Charlie. For Jesus. For Revival. “Very truly I tell you, unless a kernel of wheat falls to the ground and dies, it remains only a single seed. But if it dies, it produces many seeds.” John 12:24

Sean Feucht

37,619 次观看 • 9 个月前

I personally like to understanding things from scratch. Libraries and Frameworks are hard for me to live with (until I understand how they work internally). FreeRTOS is very commonly used in the Embedded World as the goto OS. I decided to understand how the internals work and currently recording my understanding as part of a course. I focus on Porting FreeRTOS Kernel to a new target from scratch. More details here - Here is a Preview Lecture - including the Kernel files and resolving errors as we run into them... Full list of lectures recorded till now (I'll add more in a week): -- What and Why FreeRTOS -- Demo walkthrough - End goal and Tools -- Codespace Setup -- Programmers model and Essentials of ARM M CPU -- Demo - Booting the CPU -- Jumping from Assembly code to function in C files -- Fixing the problem of function calls in C -- Getting the FreeRTOS source code and documentation -- Starting to Integrate the FreeRTOS-Kernel -- Finding and fixing the portmacro.h errors -- Finding and adding the FreeRTOSConfig.h -- Enabling heap for dynamic memory allocation -- Setting the Scheduling Rate -- Getting the Kernel to compile successfully more lectures being recorded...

I personally like to understanding things from scratch. Libraries and Frameworks are hard for me to live with (until I understand how they work internally). FreeRTOS is very commonly used in the Embedded World as the goto OS. I decided to understand how the internals work and currently recording my understanding as part of a course. I focus on Porting FreeRTOS Kernel to a new target from scratch. More details here - Here is a Preview Lecture - including the Kernel files and resolving errors as we run into them... Full list of lectures recorded till now (I'll add more in a week): -- What and Why FreeRTOS -- Demo walkthrough - End goal and Tools -- Codespace Setup -- Programmers model and Essentials of ARM M CPU -- Demo - Booting the CPU -- Jumping from Assembly code to function in C files -- Fixing the problem of function calls in C -- Getting the FreeRTOS source code and documentation -- Starting to Integrate the FreeRTOS-Kernel -- Finding and fixing the portmacro.h errors -- Finding and adding the FreeRTOSConfig.h -- Enabling heap for dynamic memory allocation -- Setting the Scheduling Rate -- Getting the Kernel to compile successfully more lectures being recorded...

Piyush Itankar

10,781 次观看 • 1 年前

'At times they've helped me get out my bed in the morning! I love working with them every single day' ❤️ Dundee boss Steven Pressley after victory in the Dundee derby made it three wins in a row for his side. #BBCFootball

'At times they've helped me get out my bed in the morning! I love working with them every single day' ❤️ Dundee boss Steven Pressley after victory in the Dundee derby made it three wins in a row for his side. #BBCFootball

BBC Sport Scotland

114,937 次观看 • 6 个月前

Flash your Kali Linux NetHunter kernel straight from the NetHunter app! The preview of 2024.3 version is now in NetHunter Store 📱📡🌽 OffSec Mobile Hacker Re4son Kernel

Flash your Kali Linux NetHunter kernel straight from the NetHunter app! The preview of 2024.3 version is now in NetHunter Store 📱📡🌽 OffSec Mobile Hacker Re4son Kernel

yesimxev

14,119 次观看 • 2 年前

"So yeah… You’re stuck with me. Not just this morning. Not just in this bed. But in every day I’ve got left. If you’ll have me." NEW SFW! GOOD MORNING EVERYONE! I Hope you wake up on the good side of the bed~ Have a good day! mwah mwah~ FULL AUDIO:

"So yeah… You’re stuck with me. Not just this morning. Not just in this bed. But in every day I’ve got left. If you’ll have me." NEW SFW! GOOD MORNING EVERYONE! I Hope you wake up on the good side of the bed~ Have a good day! mwah mwah~ FULL AUDIO:

ENVY

10,739 次观看 • 1 年前

At Modal we've built every layer of the AI infra stack from scratch — from filesystems and networking to our own async queues and multi-cloud GPU orchestration. I sat down with Arjun Narayan from Amplify Partners to go into depth on all of this, including the fun ways the Linux kernel has tried to stop us along the way:

At Modal we've built every layer of the AI infra stack from scratch — from filesystems and networking to our own async queues and multi-cloud GPU orchestration. I sat down with Arjun Narayan from Amplify Partners to go into depth on all of this, including the fun ways the Linux kernel has tried to stop us along the way:

Akshat Bubna

45,281 次观看 • 10 个月前