正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels... that are much faster than existing CUDA kernels commonly used in production. We believe that fundamentally, AI systems can and should be as resource-efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! We are excited to publish our paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition. We also release a dataset of over 17,000 verified CUDA kernels produced by The AI CUDA Engineer. Paper: Kernel Archive Webpage: HuggingFace Dataset: The AI CUDA Engineer utilizes evolutionary LLM-driven code optimization to autonomously improve the runtime of machine learning operations. Our system is not only able to convert PyTorch code into CUDA kernels, but through the use of evolution, it can also optimize the runtime performance of CUDA kernels, fuse multiple operations, and even discover novel solutions for writing efficient CUDA operations by learning from past innovations! We believe The AI CUDA Engineer opens a new era of AI-driven acceleration of AI and automated inference time optimization. We (Robert Lange, Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru) are excited to continue Sakana AI's mission of leveraging AI to improve AI.show more

Sakana AI

72,744 subscribers

1,149,339 次观看 • 1 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

10 条评论

vittorio 的头像

vittorio1 年前

japan bros are back again

Globant 的头像

Globant1 年前

🚀 Agentic AI Systems are changing what AI can do by having the power to act independently. Unlike traditional AI, which needs constant human supervision, these systems can operate more autonomously. Learn how this shift is leading to smarter solutions that can transform industries.➡️ #TechTrends2025

ludwig 的头像

ludwig1 年前

I’m going to sleep if I wake up to this having 1M+ views I will read the paper tomorrow morning else pls give me a vibe check chat

Bing Xu 的头像

Bing Xu1 年前

I quickly take a look of their report on phone, there are a few misleading parts: 1. Torch C++ code is not CUDA kernel, it is calling CUDNN under hood. 2. The highlighted example Conv3D GroupNorm, conv code is not generated at all. The speedup doesn’t make sense if numerical is wrong. 3. It claims wmma can be faster than PyTorch (CUBLAS), is definitely wrong. Probably benchmark error.

main 的头像

main1 年前

isn't there clearly something wrong with level_1->15_Matmul_for_lower_triangular_matrices? claimed 152.9x speedup for the kernel on the left over the code on the right. really?

Viraat 的头像

Viraat1 年前

Hey - wondering if you all are only working with large enterprises right now. If not, we’d love to chat! This would be extremely useful to us - we’re building low-bit models to run efficiently on Jetsons. Generating optimized CUDA code for these would be a game-changer!

aizk ✡️ 的头像

aizk ✡️1 年前

A slow 14 seconds in AI developments

Dan Mac 的头像

Dan Mac1 年前

guys seriously I can't take it anymore need to slow down

Kristof 的头像

Kristof1 年前

Japan is back

Dr Futuro - e/acc 的头像

Dr Futuro - e/acc1 年前

Wow, Japan is back! 🇯🇵

相关视频

🎉 Stoked to share The AI CUDA Engineer 👷 - our end-to-end approach for automating the design and optimization of CUDA Kernels using agentic systems. Blog 📰: Paper 📜: WebUI 📈: Dataset 💽: Awesome team work done with Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru 🤗

🎉 Stoked to share The AI CUDA Engineer 👷 - our end-to-end approach for automating the design and optimization of CUDA Kernels using agentic systems. Blog 📰: Paper 📜: WebUI 📈: Dataset 💽: Awesome team work done with Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru 🤗

Robert Lange

42,174 次观看 • 1 年前

Luminal ( is creating PyTorch for Production – an ML compiler that generates blazingly fast CUDA kernels and makes deploying to production one line of code. Congrats on the launch, Jake Stevens, Joe Fioti, and Matthew Gunton!

Luminal ( is creating PyTorch for Production – an ML compiler that generates blazingly fast CUDA kernels and makes deploying to production one line of code. Congrats on the launch, Jake Stevens, Joe Fioti, and Matthew Gunton!

Y Combinator

98,496 次观看 • 11 个月前

AI Coding Agent for Hardware Optimized Code Diana AI hardware is still constrained by software. However, with reasoning models like Deepseek R1 or OpenAI o1 and o3, AI could generate hardware-optimized code that rivals—or surpasses—human CUDA code.

AI Coding Agent for Hardware Optimized Code Diana AI hardware is still constrained by software. However, with reasoning models like Deepseek R1 or OpenAI o1 and o3, AI could generate hardware-optimized code that rivals—or surpasses—human CUDA code.

Y Combinator

60,089 次观看 • 1 年前

🎉 "This is the 20th anniversary of CUDA. We have been working on this architecture for 20 years ... to now have built up hundreds of millions of GPUs and computing systems around the world that run CUDA."

🎉 "This is the 20th anniversary of CUDA. We have been working on this architecture for 20 years ... to now have built up hundreds of millions of GPUs and computing systems around the world that run CUDA."

NVIDIA HPC Developer

27,642 次观看 • 3 个月前

Mark my words… The future of AI will not be written in CUDA! [I love this video generated by Higgsfield AI 🧩 for the @amd ROCm meetup this week]

Mark my words… The future of AI will not be written in CUDA! [I love this video generated by Higgsfield AI 🧩 for the @amd ROCm meetup this week]

Jeff Tatarchuk

12,132 次观看 • 1 年前

We are teaming up with @Databricks to supercharge AI workflows. ⚡ #CUDA will soon be at the heart of Databricks' Data Intelligence Platform, starting with AI-accelerated Photon, delivering improved speed and efficiency for customers’ data warehousing and analytics workflows.

We are teaming up with @Databricks to supercharge AI workflows. ⚡ #CUDA will soon be at the heart of Databricks' Data Intelligence Platform, starting with AI-accelerated Photon, delivering improved speed and efficiency for customers’ data warehousing and analytics workflows.

NVIDIA AI Developer

22,927 次观看 • 1 年前

in the lectures below, i hold your hand through low-level LLM systems engineering. it includes everything up to TODAY! 1) pytorch tensors 2) large matmul on cpu vs gpu 3) JAX (and why xAI uses it instead of pytorch) 4) raw cuda kernels and global threading indexing 5) triton design philosophy and softmax example 6) HIP kernels 7) mapping out the ENTIRE ecosystem + differences between CUDA and ROCm/HIP (BLAS, FFT, DNN) 8) cutlass and cute-dsl 9) pretraining, finetuning, rl, unsloth, axolotl, megatron-lm, deepspeed, nanogpt, nanochat 10) training vs inference, inference serving problems, throughput vs latency vs concurrency scaling, vllm, sglang, tensorrt-llm, tensorrt, llama.cpp, exllamav2, exllamav3, benchmark comparisons 11) projects/companies using llms to generate SOTA cuda/triton kernels 12) luminal inference 13) mojo/modular/max

in the lectures below, i hold your hand through low-level LLM systems engineering. it includes everything up to TODAY! 1) pytorch tensors 2) large matmul on cpu vs gpu 3) JAX (and why xAI uses it instead of pytorch) 4) raw cuda kernels and global threading indexing 5) triton design philosophy and softmax example 6) HIP kernels 7) mapping out the ENTIRE ecosystem + differences between CUDA and ROCm/HIP (BLAS, FFT, DNN) 8) cutlass and cute-dsl 9) pretraining, finetuning, rl, unsloth, axolotl, megatron-lm, deepspeed, nanogpt, nanochat 10) training vs inference, inference serving problems, throughput vs latency vs concurrency scaling, vllm, sglang, tensorrt-llm, tensorrt, llama.cpp, exllamav2, exllamav3, benchmark comparisons 11) projects/companies using llms to generate SOTA cuda/triton kernels 12) luminal inference 13) mojo/modular/max

Elliot Arledge

57,855 次观看 • 8 个月前

#NVIDIAGTC kicks off this year with some kudos to CUDA, the platform and model that has become instrumental to the company’s ubiquity amid the AI boom, from CEO Jensen Huang.

#NVIDIAGTC kicks off this year with some kudos to CUDA, the platform and model that has become instrumental to the company’s ubiquity amid the AI boom, from CEO Jensen Huang.

TechCrunch

11,133 次观看 • 3 个月前

Like, Love or Leave? The Rapid Transit 'Cuda! One of four promotional cars used by Chrysler. It only has 967 miles and was produced with a 440 SIX BARREL. The 'Cuda is serial number 100005.

Like, Love or Leave? The Rapid Transit 'Cuda! One of four promotional cars used by Chrysler. It only has 967 miles and was produced with a 440 SIX BARREL. The 'Cuda is serial number 100005.

Ultimate Muscle Car

39,167 次观看 • 6 个月前

Jensen Huang said the AI factory industry "will be measured in trillions of dollars." "So far his track record on that is pretty good,” - Bernstein's Stacy Rasgon Agreed, remember Jensen's past predictions about AI, CUDA, etc. ?

Jensen Huang said the AI factory industry "will be measured in trillions of dollars." "So far his track record on that is pretty good,” - Bernstein's Stacy Rasgon Agreed, remember Jensen's past predictions about AI, CUDA, etc. ?

The AI Investor

14,185 次观看 • 1 年前

I used to find writing CUDA code rather terrifying. But then I discovered a couple of tricks that actually make it quite accessible. In this video I introduce CUDA in a way that will be accessible to Python programmers, and I even show how to do it all in Colaboratory!

I used to find writing CUDA code rather terrifying. But then I discovered a couple of tricks that actually make it quite accessible. In this video I introduce CUDA in a way that will be accessible to Python programmers, and I even show how to do it all in Colaboratory!

Jeremy Howard

210,642 次观看 • 2 年前

Cerebras just had the biggest IPO of the year. Founder Andrew Feldman says the 3 most important things he had to convince investors of while doing the roadshow were that demand for inference is going to 1,000,000x, the GPU isn't the only way to do compute, and that the CUDA moat is overstated. What he said: "Jensen said some time ago on Brad Gerstner's podcast that the demand for inference will grow by a 1,000,000x, and nobody believed him. And at the same time, you saw Sam Altman displaying real vision and going out and trying to lock up huge amounts of compute, memory, data centers, and power, because he saw it too." "[We tried] to share what that means — what exponential demand means. And that we're still so early, and yet the demand for AI compute is overwhelming." "The other thing is that there are lots of ways to do this. The GPU isn't the only way. You've got TPUs, Trainium, and us. There are lots of different ways to build a solution here." "And finally — the notion that CUDA is this grand lock-in is overplayed. Gemini 3, which is an excellent model, was trained on TPUs with no CUDA. The Anthropic models were trained on Trainium with no CUDA. Some of the best models, some of the most interesting things are being done without CUDA. And that lock-in might be overplayed." $CBRS

Cerebras just had the biggest IPO of the year. Founder Andrew Feldman says the 3 most important things he had to convince investors of while doing the roadshow were that demand for inference is going to 1,000,000x, the GPU isn't the only way to do compute, and that the CUDA moat is overstated. What he said: "Jensen said some time ago on Brad Gerstner's podcast that the demand for inference will grow by a 1,000,000x, and nobody believed him. And at the same time, you saw Sam Altman displaying real vision and going out and trying to lock up huge amounts of compute, memory, data centers, and power, because he saw it too." "[We tried] to share what that means — what exponential demand means. And that we're still so early, and yet the demand for AI compute is overwhelming." "The other thing is that there are lots of ways to do this. The GPU isn't the only way. You've got TPUs, Trainium, and us. There are lots of different ways to build a solution here." "And finally — the notion that CUDA is this grand lock-in is overplayed. Gemini 3, which is an excellent model, was trained on TPUs with no CUDA. The Anthropic models were trained on Trainium with no CUDA. Some of the best models, some of the most interesting things are being done without CUDA. And that lock-in might be overplayed." $CBRS

TBPN

36,474 次观看 • 1 个月前

The largest advancement of the CUDA platform since its creation in 2006 is here 👀 Introducing CUDA Tile, a tile-based programming model that provides the ability to write algorithms at a higher level and abstract away the details of specialized hardware, such as tensor cores. Read the technical blog 👉

The largest advancement of the CUDA platform since its creation in 2006 is here 👀 Introducing CUDA Tile, a tile-based programming model that provides the ability to write algorithms at a higher level and abstract away the details of specialized hardware, such as tensor cores. Read the technical blog 👉

NVIDIA AI Developer

244,885 次观看 • 6 个月前

Open is the AI foundation. “Beyond CUDA to me is the democratization of compute... ROCm gave us a seamless training experience.” PhD student Neha Prakriya is building smarter AI with AMD GPUs + ROCm — no walled gardens, just open innovation. See more:

Open is the AI foundation. “Beyond CUDA to me is the democratization of compute... ROCm gave us a seamless training experience.” PhD student Neha Prakriya is building smarter AI with AMD GPUs + ROCm — no walled gardens, just open innovation. See more:

AI at AMD

46,551 次观看 • 1 年前

wow.. This new AI motion capture plugin in Unreal Engine can use two cameras to capture both character rigging and facial expressions in real time 🤯 it’s free, powered by NVIDIA CUDA

wow.. This new AI motion capture plugin in Unreal Engine can use two cameras to capture both character rigging and facial expressions in real time 🤯 it’s free, powered by NVIDIA CUDA

el.cine

90,973 次观看 • 1 年前

.Siemens and NVIDIA are accelerating the next wave of automation with physical AI. 🤝 Watch the full hashtag#CES2026 demo to see how Siemens is integrating NVIDIA CUDA-X libraries and Omniverse into its EDA, CAE, and digital twin portfolio, bringing physical AI across the entire industrial lifecycle. 📺

.Siemens and NVIDIA are accelerating the next wave of automation with physical AI. 🤝 Watch the full hashtag#CES2026 demo to see how Siemens is integrating NVIDIA CUDA-X libraries and Omniverse into its EDA, CAE, and digital twin portfolio, bringing physical AI across the entire industrial lifecycle. 📺

NVIDIA Omniverse

19,602 次观看 • 5 个月前

Nvidia announces the new RTX Spark, a new platform powered by the NX1 CPU, and shows off Spark laptops running 007 First Light and Forza 6. The CPU has 20 ARM based cores and a Blackwell RTX GPU with 6144 CUDA Cores. This is the same core count as a 5070, but with 128GB of unified LPDDR5X RAM memory sitting in the same package as the CPU and GPU. The entire Nvidia software stack is available, particularly CUDA, vital for AI. Nvidia's new laptops will likely be ideal for running local LLMs be cause the unified memory means you can load models up to 120-180B parameters (quantized). These laptops are expected to ship later this year and could become strong competitors to high-end MacBooks and even Mac Studios for local AI workloads, thanks to CUDA support and unified memory. Price is unannounced.

Nvidia announces the new RTX Spark, a new platform powered by the NX1 CPU, and shows off Spark laptops running 007 First Light and Forza 6. The CPU has 20 ARM based cores and a Blackwell RTX GPU with 6144 CUDA Cores. This is the same core count as a 5070, but with 128GB of unified LPDDR5X RAM memory sitting in the same package as the CPU and GPU. The entire Nvidia software stack is available, particularly CUDA, vital for AI. Nvidia's new laptops will likely be ideal for running local LLMs be cause the unified memory means you can load models up to 120-180B parameters (quantized). These laptops are expected to ship later this year and could become strong competitors to high-end MacBooks and even Mac Studios for local AI workloads, thanks to CUDA support and unified memory. Price is unannounced.

Grummz

30,573 次观看 • 1 个月前

The CUDA moat is real, but probably not for long, says CEO of AI infrastructure platform Modal Erik Bernhardsson. He says he's bullish on alternative accelerators over the 2-3 year timeline, even though there's currently zero demand from his customers for TPUs, etc. "The cost today of rewriting your software to run on those stacks is very high... But the cost is going to go down." "You're going to have software that basically lets you take CUDA-compatible stuff and run it on alternative accelerators."

The CUDA moat is real, but probably not for long, says CEO of AI infrastructure platform Modal Erik Bernhardsson. He says he's bullish on alternative accelerators over the 2-3 year timeline, even though there's currently zero demand from his customers for TPUs, etc. "The cost today of rewriting your software to run on those stacks is very high... But the cost is going to go down." "You're going to have software that basically lets you take CUDA-compatible stuff and run it on alternative accelerators."

TBPN

35,189 次观看 • 1 个月前

Chamath: US AI Startups Can Learn A Lot from DeepSeek "This is a case where necessity was the mother of invention." On E213, Chamath Palihapitiya explained how, even if the $6M number is inaccurate, DeepSeek still had some impressive breakthroughs: GRPO > PPO for reinforcement learning "These guys were like, 'Well, how am I going to do this reinforcement learning thing?' They invented a totally different algorithm." "It uses a lot less computer memory and it's highly performant." PTX > CUDA to build the model "And then the second thing that was crazy is everybody is used to building models and compiling through CUDA, which is NVIDIA's proprietary language." "And these guys worked totally around CUDA and they did something called PTX, which goes right to the bare metal." What US AI startups can learn from this: "We, meaning the West, with all the money that we've had, didn't come up with these ideas." "And I think part of why we didn't come up is not that we're not smart enough to do it, but we weren't forced to because the constraints didn't exist." "And so I just wonder how we make sure we learn this principle, meaning, when the AI company wakes up and rolls out of bed and some VC gives them $200M, maybe that's not the right answer for a Series A or a Seed." "And maybe the right answer is $2M so that they do these DeepSeek-like innovations."

Chamath: US AI Startups Can Learn A Lot from DeepSeek "This is a case where necessity was the mother of invention." On E213, Chamath Palihapitiya explained how, even if the $6M number is inaccurate, DeepSeek still had some impressive breakthroughs: GRPO > PPO for reinforcement learning "These guys were like, 'Well, how am I going to do this reinforcement learning thing?' They invented a totally different algorithm." "It uses a lot less computer memory and it's highly performant." PTX > CUDA to build the model "And then the second thing that was crazy is everybody is used to building models and compiling through CUDA, which is NVIDIA's proprietary language." "And these guys worked totally around CUDA and they did something called PTX, which goes right to the bare metal." What US AI startups can learn from this: "We, meaning the West, with all the money that we've had, didn't come up with these ideas." "And I think part of why we didn't come up is not that we're not smart enough to do it, but we weren't forced to because the constraints didn't exist." "And so I just wonder how we make sure we learn this principle, meaning, when the AI company wakes up and rolls out of bed and some VC gives them $200M, maybe that's not the right answer for a Series A or a Seed." "And maybe the right answer is $2M so that they do these DeepSeek-like innovations."

The All-In Podcast

108,815 次观看 • 1 年前

Can LLMs invent better ways to train LLMs? At Sakana AI, we’re pioneering AI-driven methods to automate AI research and discovery. We’re excited to release DiscoPOP: a new SOTA preference optimization algorithm that was discovered and written by an LLM! Our method leverages LLMs to propose and implement new preference optimization algorithms. We then train models with those algorithms and evaluate their performance, providing feedback to the LLM. By repeating this process for multiple generations in an evolutionary loop, the LLM discovers many highly-performant and novel preference optimization objectives! Paper: GitHub: Model: We proudly collaborated with the University of Oxford (Foerster Lab for AI Research (now part of BOLD)) and Cambridge University (Mihaela van der Schaar) on this groundbreaking project. Looking ahead, we envision a future where AI-driven research reduces the need for extensive human intervention and computational resources. This will accelerate scientific discoveries and innovation, pushing the boundaries of what AI can achieve.

Can LLMs invent better ways to train LLMs? At Sakana AI, we’re pioneering AI-driven methods to automate AI research and discovery. We’re excited to release DiscoPOP: a new SOTA preference optimization algorithm that was discovered and written by an LLM! Our method leverages LLMs to propose and implement new preference optimization algorithms. We then train models with those algorithms and evaluate their performance, providing feedback to the LLM. By repeating this process for multiple generations in an evolutionary loop, the LLM discovers many highly-performant and novel preference optimization objectives! Paper: GitHub: Model: We proudly collaborated with the University of Oxford (Foerster Lab for AI Research (now part of BOLD)) and Cambridge University (Mihaela van der Schaar) on this groundbreaking project. Looking ahead, we envision a future where AI-driven research reduces the need for extensive human intervention and computational resources. This will accelerate scientific discoveries and innovation, pushing the boundaries of what AI can achieve.

Sakana AI

555,859 次观看 • 2 年前