
Aadi Kulshrestha
@MankyDankyBanky • 3,429 subscribers
19 | Incoming @NVIDIA | SWE @Roblox | Prev @Shopify | Computer Science @ University of Waterloo
Videos

I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: Built with Reese Chong. Check out the repos and blog if you want to learn more. Shoutout to Modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc sunny madra Gavin
Aadi Kulshrestha808,841 views • 1 month ago
No more content to load