Loading video...
Video Failed to Load
Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels... show more
1,149,274 views • 1 year ago •via X (Twitter)
10 Comments

japan bros are back again

🚀 Agentic AI Systems are changing what AI can do by having the power to act independently. Unlike traditional AI, which needs constant human supervision, these systems can operate more autonomously. Learn how this shift is leading to smarter solutions that can transform industries.➡️ #TechTrends2025

I’m going to sleep if I wake up to this having 1M+ views I will read the paper tomorrow morning else pls give me a vibe check chat

I quickly take a look of their report on phone, there are a few misleading parts: 1. Torch C++ code is not CUDA kernel, it is calling CUDNN under hood. 2. The highlighted example Conv3D GroupNorm, conv code is not generated at all. The speedup doesn’t make sense if numerical is wrong. 3. It claims wmma can be faster than PyTorch (CUBLAS), is definitely wrong. Probably benchmark error.

isn't there clearly something wrong with level_1->15_Matmul_for_lower_triangular_matrices? claimed 152.9x speedup for the kernel on the left over the code on the right. really?

Hey - wondering if you all are only working with large enterprises right now. If not, we’d love to chat! This would be extremely useful to us - we’re building low-bit models to run efficiently on Jetsons. Generating optimized CUDA code for these would be a game-changer!

A slow 14 seconds in AI developments

guys seriously I can't take it anymore need to slow down

Japan is back

Wow, Japan is back! 🇯🇵


