
Elliot Arledge
@elliotarledge • 37,907 subscribers
KernelBench-Hard i made the 12-hour CUDA course on FreeCodeCamp "the timelapse guy" books i wrote: https://t.co/2hzezpoDwQ @shipfr8 alum
Shorts
Videos

timelapse #147 (15.5 hrs) - woke up to MiniMax (official) M3 launch including my kernelbench-hard (lowest score at below 30% which emphasizes the hardness) - did a space with the minimax and together ai folks - burned 1.1B tokens - got nanogpt at nvfp4 training stability to match bf16. this is a prereq for another problem im trying to solve - got my timelapse workflow nailed with a solid html page lol - loosing patience from anthropic rate limits
Elliot Arledge251,378 次观看 • 1 天前

timelapse #83 (22 hrs): - it was very easy to dive super deep into anything i needed to (this is what i focused on today because not all days are like this) - finding the grok code fast 1 + grok 4 for deep thinking and verification combo to be super useful in cursor. speed was solid - hard to imagine myself spending many more mental clock cycles in a 24 hr period - had to pull out qwen3-next’s gated deltanet + linear attention from bleeding edge hf transformers to begin implementing a multi-gpu fp8 trainer from scratch. this is so damn bleeding edge and i underestimated how much effort this has and will require lol - lots of diet coke and oats - shipped the template which the core chapters of my book will be built on: - all im missing now is flash attention 1/2 mastery (fa2 tmrw), intuition on making topk faster (for arbitrary row length), what i should and shouldnt teach in cutlass/cute, hopper/blackwell gemm kernel mastery (down to fp4) — shoutout to Pranjal for making this easier for me. his blog post is amazing - caught up w/ Mati Roy - im feeling great mentally but not so good physically as im writing this and about to pass out
Elliot Arledge2,255,154 次观看 • 8 个月前

timelapse #85 (27.5 hrs): - currently cant rely on any other coding models except grok code fast 1 + grok 4 fast (for complex reasoning grok 4 fast is 20 cents for 1M tokens) - wrote qwen3-next trainer entirely from scratch to make it more managable - each piece completely done by grok-code-fast-1 in cursor as it seems to handle this task pretty well without the grok 4 fast reasoning - take on smaller problems and complete them quickly (makes it easier with 400 toks/sec over the api) - got distributed fp8 qwen3-next trainer running at 0.8 seconds per step on 8xH100s (still need to finish checkpoint loading logic) - perfect timing as the fp8 version of qwen3-next drops as im writing this - ill be in LA in 2 days (will visit SF mid way through as well) - 12.5% margarita - steak dinner with family - gained intuition on FlashAttention in very long context settings - caught up w/ Kearm h/eng and Arnie Ramesh
Elliot Arledge283,820 次观看 • 8 个月前

This is my favorite clip of the new Elon pod. He opens up saying xAI struggles with memory usage/bandwidth and CUDA kernel optimization (matmul, attention, MoE, etc). If you are good kernel or performance engineering in general, you should apply. Steer the world in a better direction.
Elliot Arledge158,971 次观看 • 4 个月前

timelapse #21 (12 hrs): - leetcode practice for xAI coding test - updated mnist-cuda (find repo in pinned) by adding a new CUDA training script w/ an extra hidden layer and a feature to show generalization - notes on LLM training datasets, architectures, training config, etc
Elliot Arledge216,069 次观看 • 1 年前

timelapse #86 (15 hrs): - got my first OOM on 8xB200 node - defaulting back to grok-code-fast-1, the fastest reliable coding model with by far most intuitive instruction following, combined with grok 4 fast reasoning to plan before i let grok code work its magic - drank 2 large tim hortons iced capps, loaded myself w/ creatine, daily nootropics - tried out gpt-5-codex but it simply doesnt match the speed i require when i go deep into one thing at a time sequentially - got caught watching youtube videos in the middle, need to make sure i block any and all content that could get in my way - caught up on all book revisions so getting super ahead with other chapters - developed an overnight addiction to switching color themes in cursor - did some pair programming w/ Kearm h/eng using Tuple on free trial - applying for O-1
Elliot Arledge102,911 次观看 • 8 个月前

in the lectures below, i hold your hand through low-level LLM systems engineering. it includes everything up to TODAY! 1) pytorch tensors 2) large matmul on cpu vs gpu 3) JAX (and why xAI uses it instead of pytorch) 4) raw cuda kernels and global threading indexing 5) triton design philosophy and softmax example 6) HIP kernels 7) mapping out the ENTIRE ecosystem + differences between CUDA and ROCm/HIP (BLAS, FFT, DNN) 8) cutlass and cute-dsl 9) pretraining, finetuning, rl, unsloth, axolotl, megatron-lm, deepspeed, nanogpt, nanochat 10) training vs inference, inference serving problems, throughput vs latency vs concurrency scaling, vllm, sglang, tensorrt-llm, tensorrt, llama.cpp, exllamav2, exllamav3, benchmark comparisons 11) projects/companies using llms to generate SOTA cuda/triton kernels 12) luminal inference 13) mojo/modular/max
Elliot Arledge57,855 次观看 • 7 个月前

timelapse #137 (16.5 hrs) - first day back from the mountains - got amazon prime, ordered and got soldering kit, esp32, other stuff - ripping apart cursor - had too much caffiene - desk is messy and thats a good thing - looking at upgrading to 128gb macbook pro - almost couldnt sleep due to chinese model releases yesterday
Elliot Arledge26,955 次观看 • 4 个月前

got k2 thinking working on vllm completely maxxing out vram on 8xH100 i span up. had to quantize the kv cache to fp8 and decrease seq len to 1024 or else it would OOM. this is not sped up (~3.0 toks/sec). this took about 2 hrs of tweaking serving settings. livestreaming this rn
Elliot Arledge39,295 次观看 • 6 个月前

timelapse #35 (17 hrs): - trained a theoretical reasoning chatML expert for later MoE merge - broke apart multi-head latent attention - sorted through HQQ, HQQ+, int8 optimizer quantization - touched up on kv-cache optimization w/ paged attention - planned for new years party in SF - we cursed a qwen tokenizer then fixed it - kicked off a merge for instructs, math, coders, and generalist 1.5B models into MoE - learned how to only change chat templates without touching weights
Elliot Arledge68,843 次观看 • 1 年前

timelapse #72 (7 hrs): - back in Canada and seriously couldn’t think of taking a break (im having so much fun all day just dumping my heart into making my work the highest quality) - setup new raspberry 5 to get the consistent Timelapse’s going (and run some simple background tasks) - very deep cuda book working session (using zed IDE) - figured out how to im going to articulate the hardest kernel optimizations to my readers - more research into evolution of Nvidia tensor cores over the years and what they compile down to for each architecture - steak dinner w/ family - went for ice cream with a friend - spaces w/ Adrian Dittmann
Elliot Arledge34,964 次观看 • 8 个月前