Loading video...

Video Failed to Load

Go Home

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels...

1,149,274 views • 1 year ago •via X (Twitter)

10 Comments

vittorio's profile picture
vittorio1 year ago

japan bros are back again

Globant's profile picture
Globant1 year ago

🚀 Agentic AI Systems are changing what AI can do by having the power to act independently. Unlike traditional AI, which needs constant human supervision, these systems can operate more autonomously. Learn how this shift is leading to smarter solutions that can transform industries.➡️ #TechTrends2025

ludwig's profile picture
ludwig1 year ago

I’m going to sleep if I wake up to this having 1M+ views I will read the paper tomorrow morning else pls give me a vibe check chat

Bing Xu's profile picture
Bing Xu1 year ago

I quickly take a look of their report on phone, there are a few misleading parts: 1. Torch C++ code is not CUDA kernel, it is calling CUDNN under hood. 2. The highlighted example Conv3D GroupNorm, conv code is not generated at all. The speedup doesn’t make sense if numerical is wrong. 3. It claims wmma can be faster than PyTorch (CUBLAS), is definitely wrong. Probably benchmark error.

main's profile picture
main1 year ago

isn't there clearly something wrong with level_1->15_Matmul_for_lower_triangular_matrices? claimed 152.9x speedup for the kernel on the left over the code on the right. really?

Viraat's profile picture
Viraat1 year ago

Hey - wondering if you all are only working with large enterprises right now. If not, we’d love to chat! This would be extremely useful to us - we’re building low-bit models to run efficiently on Jetsons. Generating optimized CUDA code for these would be a game-changer!

aizk ✡️'s profile picture
aizk ✡️1 year ago

A slow 14 seconds in AI developments

Dan Mac's profile picture
Dan Mac1 year ago

guys seriously I can't take it anymore need to slow down

Kristof's profile picture
Kristof1 year ago

Japan is back

Dr Futuro - e/acc's profile picture
Dr Futuro - e/acc1 year ago

Wow, Japan is back! 🇯🇵

Related Videos

WHALE REPORT! 🐳🐳🐳🐳🐳🐳🐳🐳🐳🐳🐳 15 hours of research. JENSEN OF NVIDA WAS RIGHT! DeepSeek-V4 introduces several architectural and practical innovations that make it stand out, especially for long-context agentic workloads. The most geopolitically and technically notable aspect is its strong push toward hardware independence from NVIDIA’s CUDA ecosystem. The Big Story: Reduced/Limited Reliance on CUDA One of the most discussed “new things” is DeepSeek’s deliberate move toward CUDA independence (or at least strong dual-stack capability), driven by export controls and a push for sovereign AI stacks in China. •Training & Inference on Huawei Ascend: Huawei confirmed full support for V4 inference on its Ascend supernodes (e.g., Ascend 950/910 series) via CANN (Compute Architecture for Neural Networks), Huawei’s CUDA equivalent. DeepSeek co-optimized kernels directly with Huawei. Reports indicate fine-grained Expert Parallelism validated on Ascend NPUs, with speedups of 1.5–1.73x on non-NVIDIA platforms. •CANN Migration: The model (or key parts of its stack) was adapted/rewritten for CANN. This involved significant engineering effort, contributing to release delays. It represents a “full shift” or major de-NVIDIA-ization in the inference path for domestic deployments, proving frontier-scale MoE can run efficiently on alternative hardware. •Still Some CUDA Compatibility: Open-source components like the MegaMoE mega-kernel in DeepGEMM remain CUDA-based for NVIDIA users (with strong performance on Hopper/Blackwell). vLLM recipes support NVIDIA H200/B200 etc., and the weights run on CUDA setups. Earlier DeepSeek work (e.g., on V3) used low-level PTX to bypass some CUDA limitations for communication-heavy MoE. V4 builds on this hybrid pragmatism but emphasizes CANN-native paths. •Broader Implications: This validates a dual (or alternative) ecosystem. It reduces reliance on restricted NVIDIA tech for training/inference in China, potentially lowering costs and increasing resilience. Analysts see it as a blueprint for sovereign AI and a challenge to the “CUDA moat.”4635 Other notables include three-tier reasoning modes (Non-think / Think High / Think Max), Muon optimizer usage, heavy data curation (32T+ tokens, removing synthetic content), and MIT licensing for broad openness. In short, V4 isn’t just another bigger MoE but its efficiency tricks for usable million-token context, agent optimizations, and hardware-stack diversification (especially the CANN emphasis) mark it as a pragmatic leap for open, cost-effective, long-horizon AI. The CUDA/CANN angle is particularly significant in the current geopolitical context, showing how architectural and software innovations can mitigate hardware access barriers.50 Weights are on Hugging Face; API is live with competitive pricing. Quantized versions and community integrations are being released now. Last week Jensen Huang of Nvidia talked about why THE US MUST LEAD OPEN SOURCE. Welp we found out. Listen closely to this interview and the lack of knowledge the interviewer has. It is emblematic of the points of view permeating large AI companies. Perhaps you can hear Jensen now?

Brian Roemmele

70,873 views • 1 month ago