
Elliot Arledge
@elliotarledge • 37,907 subscribers
KernelBench-Hard i made the 12-hour CUDA course on FreeCodeCamp "the timelapse guy" books i wrote: https://t.co/2hzezpoDwQ @shipfr8 alum
Shorts
Videos

timelapse #147 (15.5 hrs) - woke up to MiniMax (official) M3 launch including my kernelbench-hard (lowest score at below 30% which emphasizes the hardness) - did a space with the minimax and together ai folks - burned 1.1B tokens - got nanogpt at nvfp4 training stability to match bf16. this is a prereq for another problem im trying to solve - got my timelapse workflow nailed with a solid html page lol - loosing patience from anthropic rate limits
Elliot Arledge251,378 Aufrufe • vor 1 Tag

Co-Founder of Cerebras explains their WSE simplified design compared to classical GPUs made by NVIDIA.
Elliot Arledge174,837 Aufrufe • vor 12 Tagen

timelapse #83 (22 hrs): - it was very easy to dive super deep into anything i needed to (this is what i focused on today because not all days are like this) - finding the grok code fast 1 + grok 4 for deep thinking and verification combo to be super useful in cursor. speed was solid - hard to imagine myself spending many more mental clock cycles in a 24 hr period - had to pull out qwen3-next’s gated deltanet + linear attention from bleeding edge hf transformers to begin implementing a multi-gpu fp8 trainer from scratch. this is so damn bleeding edge and i underestimated how much effort this has and will require lol - lots of diet coke and oats - shipped the template which the core chapters of my book will be built on: - all im missing now is flash attention 1/2 mastery (fa2 tmrw), intuition on making topk faster (for arbitrary row length), what i should and shouldnt teach in cutlass/cute, hopper/blackwell gemm kernel mastery (down to fp4) — shoutout to Pranjal for making this easier for me. his blog post is amazing - caught up w/ Mati Roy - im feeling great mentally but not so good physically as im writing this and about to pass out
Elliot Arledge2,255,154 Aufrufe • vor 8 Monaten

timelapse #85 (27.5 hrs): - currently cant rely on any other coding models except grok code fast 1 + grok 4 fast (for complex reasoning grok 4 fast is 20 cents for 1M tokens) - wrote qwen3-next trainer entirely from scratch to make it more managable - each piece completely done by grok-code-fast-1 in cursor as it seems to handle this task pretty well without the grok 4 fast reasoning - take on smaller problems and complete them quickly (makes it easier with 400 toks/sec over the api) - got distributed fp8 qwen3-next trainer running at 0.8 seconds per step on 8xH100s (still need to finish checkpoint loading logic) - perfect timing as the fp8 version of qwen3-next drops as im writing this - ill be in LA in 2 days (will visit SF mid way through as well) - 12.5% margarita - steak dinner with family - gained intuition on FlashAttention in very long context settings - caught up w/ Kearm h/eng and Arnie Ramesh
Elliot Arledge283,820 Aufrufe • vor 8 Monaten

This is my favorite clip of the new Elon pod. He opens up saying xAI struggles with memory usage/bandwidth and CUDA kernel optimization (matmul, attention, MoE, etc). If you are good kernel or performance engineering in general, you should apply. Steer the world in a better direction.
Elliot Arledge158,971 Aufrufe • vor 4 Monaten

timelapse #21 (12 hrs): - leetcode practice for xAI coding test - updated mnist-cuda (find repo in pinned) by adding a new CUDA training script w/ an extra hidden layer and a feature to show generalization - notes on LLM training datasets, architectures, training config, etc
Elliot Arledge216,069 Aufrufe • vor 1 Jahr

timelapse #86 (15 hrs): - got my first OOM on 8xB200 node - defaulting back to grok-code-fast-1, the fastest reliable coding model with by far most intuitive instruction following, combined with grok 4 fast reasoning to plan before i let grok code work its magic - drank 2 large tim hortons iced capps, loaded myself w/ creatine, daily nootropics - tried out gpt-5-codex but it simply doesnt match the speed i require when i go deep into one thing at a time sequentially - got caught watching youtube videos in the middle, need to make sure i block any and all content that could get in my way - caught up on all book revisions so getting super ahead with other chapters - developed an overnight addiction to switching color themes in cursor - did some pair programming w/ Kearm h/eng using Tuple on free trial - applying for O-1
Elliot Arledge102,911 Aufrufe • vor 8 Monaten

in the lectures below, i hold your hand through low-level LLM systems engineering. it includes everything up to TODAY! 1) pytorch tensors 2) large matmul on cpu vs gpu 3) JAX (and why xAI uses it instead of pytorch) 4) raw cuda kernels and global threading indexing 5) triton design philosophy and softmax example 6) HIP kernels 7) mapping out the ENTIRE ecosystem + differences between CUDA and ROCm/HIP (BLAS, FFT, DNN) 8) cutlass and cute-dsl 9) pretraining, finetuning, rl, unsloth, axolotl, megatron-lm, deepspeed, nanogpt, nanochat 10) training vs inference, inference serving problems, throughput vs latency vs concurrency scaling, vllm, sglang, tensorrt-llm, tensorrt, llama.cpp, exllamav2, exllamav3, benchmark comparisons 11) projects/companies using llms to generate SOTA cuda/triton kernels 12) luminal inference 13) mojo/modular/max
Elliot Arledge57,855 Aufrufe • vor 7 Monaten

Qwen3.6 35B (3B active) with DFlash (c=1) running at 164 toks/sec decode on creative writing.
Elliot Arledge13,439 Aufrufe • vor 1 Monat

timelapse #137 (16.5 hrs) - first day back from the mountains - got amazon prime, ordered and got soldering kit, esp32, other stuff - ripping apart cursor - had too much caffiene - desk is messy and thats a good thing - looking at upgrading to 128gb macbook pro - almost couldnt sleep due to chinese model releases yesterday
Elliot Arledge26,955 Aufrufe • vor 4 Monaten

got k2 thinking working on vllm completely maxxing out vram on 8xH100 i span up. had to quantize the kv cache to fp8 and decrease seq len to 1024 or else it would OOM. this is not sped up (~3.0 toks/sec). this took about 2 hrs of tweaking serving settings. livestreaming this rn
Elliot Arledge39,295 Aufrufe • vor 6 Monaten

timelapse #35 (17 hrs): - trained a theoretical reasoning chatML expert for later MoE merge - broke apart multi-head latent attention - sorted through HQQ, HQQ+, int8 optimizer quantization - touched up on kv-cache optimization w/ paged attention - planned for new years party in SF - we cursed a qwen tokenizer then fixed it - kicked off a merge for instructs, math, coders, and generalist 1.5B models into MoE - learned how to only change chat templates without touching weights
Elliot Arledge68,843 Aufrufe • vor 1 Jahr

timelapse #72 (7 hrs): - back in Canada and seriously couldn’t think of taking a break (im having so much fun all day just dumping my heart into making my work the highest quality) - setup new raspberry 5 to get the consistent Timelapse’s going (and run some simple background tasks) - very deep cuda book working session (using zed IDE) - figured out how to im going to articulate the hardest kernel optimizations to my readers - more research into evolution of Nvidia tensor cores over the years and what they compile down to for each architecture - steak dinner w/ family - went for ice cream with a friend - spaces w/ Adrian Dittmann
Elliot Arledge34,964 Aufrufe • vor 8 Monaten

timelapse #100 (1438 hrs): - wear headphones and watch til the end
Elliot Arledge26,558 Aufrufe • vor 7 Monaten