Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA... show more

ani

6,794 subscribers

806,592 views • 2 months ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: Built with Reese Chong. Check out the repos and blog if you want to learn more. Shoutout to Modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc sunny madra Gavin

I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: Built with Reese Chong. Check out the repos and blog if you want to learn more. Shoutout to Modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc sunny madra Gavin

Aadi Kulshrestha

810,145 views • 2 months ago

Sentra just killed Google Research's TurboQuant. SpectralQuant — 5.95× KV cache compression on Mistral 7B at +7.5% perplexity overhead. TurboQuant at the same compression: +22%. 3× less degradation. 15-second calibration. One per-model, then drop-in for any HuggingFace LLM, ViT, ESM, AlphaFold Evoformer, or VideoMAE. Check out the findings and how the mechanism works below. ↓

Sentra just killed Google Research's TurboQuant. SpectralQuant — 5.95× KV cache compression on Mistral 7B at +7.5% perplexity overhead. TurboQuant at the same compression: +22%. 3× less degradation. 15-second calibration. One per-model, then drop-in for any HuggingFace LLM, ViT, ESM, AlphaFold Evoformer, or VideoMAE. Check out the findings and how the mechanism works below. ↓

Ashwin Gopinath

59,026 views • 1 month ago

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

52,588 views • 2 months ago

YOUR PARENTS PAID FOR THE CUDA MOAT! The #1 contributor to the CUDA MOAT isn't the the developers at NVIDIA, but it is the millions of developers outside of NVIDIA that invent new algorithms for CUDA like Flash Attention. For most of them, it started with an GeForce gaming GPU. NVIDIA is the only companies that has an reasonable good developer stack on consumer grade GPUs. As people grow up beyond playing CSGO & League of Legends & Minecraft, they either become anime weeaboos or they start programming on their existing computer with has an GeForce GPU

YOUR PARENTS PAID FOR THE CUDA MOAT! The #1 contributor to the CUDA MOAT isn't the the developers at NVIDIA, but it is the millions of developers outside of NVIDIA that invent new algorithms for CUDA like Flash Attention. For most of them, it started with an GeForce gaming GPU. NVIDIA is the only companies that has an reasonable good developer stack on consumer grade GPUs. As people grow up beyond playing CSGO & League of Legends & Minecraft, they either become anime weeaboos or they start programming on their existing computer with has an GeForce GPU

SemiAnalysis

25,230 views • 2 months ago

LTX-2.3 in 4.5s? A 5s 1080p video gen in a blink? FastVideo makes video gen feel almost interactive. - On ONE GPU - NVFP4 quantization + fused Blackwell kernels. - Yeah… a B200 Try this madness yourself

LTX-2.3 in 4.5s? A 5s 1080p video gen in a blink? FastVideo makes video gen feel almost interactive. - On ONE GPU - NVFP4 quantization + fused Blackwell kernels. - Yeah… a B200 Try this madness yourself

Wildminder

21,462 views • 3 months ago

i just beat Google DeepMind's turboquant introducing Shard. 10x KV cache compression on Llama-3.1-8B. zero quality loss - 10x @ 8K context, 11.2x @ 32K - NIAH recall 1.000 across 4K-32K - LongBench Δ ≈ 0 vs FP16 turboquant tops out at 4-6x at the same quality. we doubled it. read more: Kirri

i just beat Google DeepMind's turboquant introducing Shard. 10x KV cache compression on Llama-3.1-8B. zero quality loss - 10x @ 8K context, 11.2x @ 32K - NIAH recall 1.000 across 4K-32K - LongBench Δ ≈ 0 vs FP16 turboquant tops out at 4-6x at the same quality. we doubled it. read more: Kirri

Krish

154,488 views • 1 month ago

Stephen Jones, NVIDIA CUDA Architect, put it best. What excites him most about CUDA Tile is not just what it does on paper, but what developers will build with it in ways the team never imagined.

Stephen Jones, NVIDIA CUDA Architect, put it best. What excites him most about CUDA Tile is not just what it does on paper, but what developers will build with it in ways the team never imagined.

NVIDIA Newsroom

14,488 views • 6 months ago

You can run CUDA, on a Mac ARM GPU, in the browser. It sounds ridiculous but it actually works. HipScript chains CUDA, to OpenCL, to Vulkan, to Tint (Google’s shader translator), to a WASM WebGPU. I got a plasma simulation in running in just a few minutes, no NVIDIA GPU!

You can run CUDA, on a Mac ARM GPU, in the browser. It sounds ridiculous but it actually works. HipScript chains CUDA, to OpenCL, to Vulkan, to Tint (Google’s shader translator), to a WASM WebGPU. I got a plasma simulation in running in just a few minutes, no NVIDIA GPU!

LaurieWired

160,223 views • 1 year ago

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels that are much faster than existing CUDA kernels commonly used in production. We believe that fundamentally, AI systems can and should be as resource-efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! We are excited to publish our paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition. We also release a dataset of over 17,000 verified CUDA kernels produced by The AI CUDA Engineer. Paper: Kernel Archive Webpage: HuggingFace Dataset: The AI CUDA Engineer utilizes evolutionary LLM-driven code optimization to autonomously improve the runtime of machine learning operations. Our system is not only able to convert PyTorch code into CUDA kernels, but through the use of evolution, it can also optimize the runtime performance of CUDA kernels, fuse multiple operations, and even discover novel solutions for writing efficient CUDA operations by learning from past innovations! We believe The AI CUDA Engineer opens a new era of AI-driven acceleration of AI and automated inference time optimization. We (Robert Lange, Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru) are excited to continue Sakana AI's mission of leveraging AI to improve AI.

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels that are much faster than existing CUDA kernels commonly used in production. We believe that fundamentally, AI systems can and should be as resource-efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! We are excited to publish our paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition. We also release a dataset of over 17,000 verified CUDA kernels produced by The AI CUDA Engineer. Paper: Kernel Archive Webpage: HuggingFace Dataset: The AI CUDA Engineer utilizes evolutionary LLM-driven code optimization to autonomously improve the runtime of machine learning operations. Our system is not only able to convert PyTorch code into CUDA kernels, but through the use of evolution, it can also optimize the runtime performance of CUDA kernels, fuse multiple operations, and even discover novel solutions for writing efficient CUDA operations by learning from past innovations! We believe The AI CUDA Engineer opens a new era of AI-driven acceleration of AI and automated inference time optimization. We (Robert Lange, Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru) are excited to continue Sakana AI's mission of leveraging AI to improve AI.

Sakana AI

1,149,339 views • 1 year ago

Your local AI just got up to 5x more memory. Same model. Same device. Nearly zero accuracy loss. QVAC SDK 0.12.0 integrates TurboQuant - Google Research's latest memory optimisation algorithm. What is TurboQuant? The KV cache is the memory your model uses to track a conversation. As context grows, it fills up fast. 32K tokens. 64K. Game over. TurboQuant compresses it up to 5x with no accuracy loss. What does it unlock for you? Your app had a 16K token ceiling? It's now 96K. On the same device. Just update the QVAC SDK to get up to 5x more efficiency. No code changes. All from one SDK. The TurboQuant integration unlocks sovereign intelligence for more people, on more devices. Learn more →

Your local AI just got up to 5x more memory. Same model. Same device. Nearly zero accuracy loss. QVAC SDK 0.12.0 integrates TurboQuant - Google Research's latest memory optimisation algorithm. What is TurboQuant? The KV cache is the memory your model uses to track a conversation. As context grows, it fills up fast. 32K tokens. 64K. Game over. TurboQuant compresses it up to 5x with no accuracy loss. What does it unlock for you? Your app had a 16K token ceiling? It's now 96K. On the same device. Just update the QVAC SDK to get up to 5x more efficiency. No code changes. All from one SDK. The TurboQuant integration unlocks sovereign intelligence for more people, on more devices. Learn more →

QVAC

15,797,496 views • 23 days ago

The latest MLX has a CUDA back-end! To get started: pip install "mlx[cuda]" With the same codebase you can develop locally, run your model on Apple silicon, or in the cloud on Nvidia GPUs. MLX is designed around Apple silicon - which has a unified memory architecture. It uses the same design with CUDA. So there's no need to move arrays around from CPU memory to GPU memory. Note, this is early days - some operations are missing and performance is still being optimized. But it's already quite fast for Transformer training, text generation, and more! Here's a demo using mlx-lm to generate text with Llama 3 8B (bf16) on an A100:

The latest MLX has a CUDA back-end! To get started: pip install "mlx[cuda]" With the same codebase you can develop locally, run your model on Apple silicon, or in the cloud on Nvidia GPUs. MLX is designed around Apple silicon - which has a unified memory architecture. It uses the same design with CUDA. So there's no need to move arrays around from CPU memory to GPU memory. Note, this is early days - some operations are missing and performance is still being optimized. But it's already quite fast for Transformer training, text generation, and more! Here's a demo using mlx-lm to generate text with Llama 3 8B (bf16) on an A100:

Awni Hannun

42,761 views • 11 months ago

Nvidia announces the new RTX Spark, a new platform powered by the NX1 CPU, and shows off Spark laptops running 007 First Light and Forza 6. The CPU has 20 ARM based cores and a Blackwell RTX GPU with 6144 CUDA Cores. This is the same core count as a 5070, but with 128GB of unified LPDDR5X RAM memory sitting in the same package as the CPU and GPU. The entire Nvidia software stack is available, particularly CUDA, vital for AI. Nvidia's new laptops will likely be ideal for running local LLMs be cause the unified memory means you can load models up to 120-180B parameters (quantized). These laptops are expected to ship later this year and could become strong competitors to high-end MacBooks and even Mac Studios for local AI workloads, thanks to CUDA support and unified memory. Price is unannounced.

Nvidia announces the new RTX Spark, a new platform powered by the NX1 CPU, and shows off Spark laptops running 007 First Light and Forza 6. The CPU has 20 ARM based cores and a Blackwell RTX GPU with 6144 CUDA Cores. This is the same core count as a 5070, but with 128GB of unified LPDDR5X RAM memory sitting in the same package as the CPU and GPU. The entire Nvidia software stack is available, particularly CUDA, vital for AI. Nvidia's new laptops will likely be ideal for running local LLMs be cause the unified memory means you can load models up to 120-180B parameters (quantized). These laptops are expected to ship later this year and could become strong competitors to high-end MacBooks and even Mac Studios for local AI workloads, thanks to CUDA support and unified memory. Price is unannounced.

Grummz

30,573 views • 23 days ago

🎬 Higgsfield AI 🧩 is accelerating professional video production, enabling millions of creators to produce content at unprecedented speed and scale, powered by NVIDIA Blackwell. They utilized our accelerated computing platform for: ✅ 30% faster training on NVIDIA HGX B200 and B300 systems, including on Nebius ✅ Infrastructure reliability through NVIDIA NVSentinel and DCGM ✅ Advanced creative control and optimized performance with NVIDIA CUDA libraries and InfiniBand networking Learn more ➡️

🎬 Higgsfield AI 🧩 is accelerating professional video production, enabling millions of creators to produce content at unprecedented speed and scale, powered by NVIDIA Blackwell. They utilized our accelerated computing platform for: ✅ 30% faster training on NVIDIA HGX B200 and B300 systems, including on Nebius ✅ Infrastructure reliability through NVIDIA NVSentinel and DCGM ✅ Advanced creative control and optimized performance with NVIDIA CUDA libraries and InfiniBand networking Learn more ➡️

NVIDIA AI Infrastructure

59,447 views • 1 month ago

Yesterday we announced that the QVAC SDK update unlocked up to 5x more context on your device thanks to TurboQuant. Today, we’ll go through how we got there. TurboQuant (Google Research, ICLR 2026) is a two-stage KV-cache compression algorithm. Stage 1 - PolarQuant: convert KV vectors from Cartesian (x, y, z...) to polar coordinates. Angles compress predictably down to 3-4 bits. Stage 2 - QJL: 1-bit Johnson-Lindenstrauss correction. Cleans up residual error. Total: ~4-5 bits per value. No retraining. No calibration. QVAC ported it to Vulkan inside qvac-fabric-llm.cpp. Currently, TurboQuant is supported only for AMD & NVIDIA GPUs, support for iOS, Android & Apple Silicon coming next. Full algorithm walkthrough + benchmarks + code examples →

Yesterday we announced that the QVAC SDK update unlocked up to 5x more context on your device thanks to TurboQuant. Today, we’ll go through how we got there. TurboQuant (Google Research, ICLR 2026) is a two-stage KV-cache compression algorithm. Stage 1 - PolarQuant: convert KV vectors from Cartesian (x, y, z...) to polar coordinates. Angles compress predictably down to 3-4 bits. Stage 2 - QJL: 1-bit Johnson-Lindenstrauss correction. Cleans up residual error. Total: ~4-5 bits per value. No retraining. No calibration. QVAC ported it to Vulkan inside qvac-fabric-llm.cpp. Currently, TurboQuant is supported only for AMD & NVIDIA GPUs, support for iOS, Android & Apple Silicon coming next. Full algorithm walkthrough + benchmarks + code examples →

QVAC

14,467,728 views • 22 days ago

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Alok

117,304 views • 7 days ago

We’re thrilled to open-source TriAttention! 🚀 🦞 Deploy OpenClaw (32B LLM) on a single 24GB RTX 4090 locally 💻Full code open-source & vLLM-ready for one-click deployment ⚡️ 2.5× faster inference speed & 10.7× less KV cache memory usage TriAttention is a novel KV cache compression method built on rigorous trigonometric analysis in the Pre‑RoPE space for efficient LLM long reasoning. Github Repo: Paper Link: Homepage:

We’re thrilled to open-source TriAttention! 🚀 🦞 Deploy OpenClaw (32B LLM) on a single 24GB RTX 4090 locally 💻Full code open-source & vLLM-ready for one-click deployment ⚡️ 2.5× faster inference speed & 10.7× less KV cache memory usage TriAttention is a novel KV cache compression method built on rigorous trigonometric analysis in the Pre‑RoPE space for efficient LLM long reasoning. Github Repo: Paper Link: Homepage:

Yukang Chen

197,211 views • 2 months ago

in the lectures below, i hold your hand through low-level LLM systems engineering. it includes everything up to TODAY! 1) pytorch tensors 2) large matmul on cpu vs gpu 3) JAX (and why xAI uses it instead of pytorch) 4) raw cuda kernels and global threading indexing 5) triton design philosophy and softmax example 6) HIP kernels 7) mapping out the ENTIRE ecosystem + differences between CUDA and ROCm/HIP (BLAS, FFT, DNN) 8) cutlass and cute-dsl 9) pretraining, finetuning, rl, unsloth, axolotl, megatron-lm, deepspeed, nanogpt, nanochat 10) training vs inference, inference serving problems, throughput vs latency vs concurrency scaling, vllm, sglang, tensorrt-llm, tensorrt, llama.cpp, exllamav2, exllamav3, benchmark comparisons 11) projects/companies using llms to generate SOTA cuda/triton kernels 12) luminal inference 13) mojo/modular/max

in the lectures below, i hold your hand through low-level LLM systems engineering. it includes everything up to TODAY! 1) pytorch tensors 2) large matmul on cpu vs gpu 3) JAX (and why xAI uses it instead of pytorch) 4) raw cuda kernels and global threading indexing 5) triton design philosophy and softmax example 6) HIP kernels 7) mapping out the ENTIRE ecosystem + differences between CUDA and ROCm/HIP (BLAS, FFT, DNN) 8) cutlass and cute-dsl 9) pretraining, finetuning, rl, unsloth, axolotl, megatron-lm, deepspeed, nanogpt, nanochat 10) training vs inference, inference serving problems, throughput vs latency vs concurrency scaling, vllm, sglang, tensorrt-llm, tensorrt, llama.cpp, exllamav2, exllamav3, benchmark comparisons 11) projects/companies using llms to generate SOTA cuda/triton kernels 12) luminal inference 13) mojo/modular/max

Elliot Arledge

57,855 views • 8 months ago

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

vLLM

39,777 views • 12 days ago

CUDA MODE hackathon today! Here's Andrej Karpathy on the 🏖️ origin story of llm.c, and what it hints at for the fast, simple, llm-compiled future of custom software.

CUDA MODE hackathon today! Here's Andrej Karpathy on the 🏖️ origin story of llm.c, and what it hints at for the fast, simple, llm-compiled future of custom software.

swyx

97,440 views • 1 year ago

What's inside a GPU Modern GPUs are marvels of engineering, packing billions of transistors into a single chip. The NVIDIA RTX 3090 alone holds around 28 billion, while the latest Blackwell B200 GPU has already crossed 100 billion.

What's inside a GPU Modern GPUs are marvels of engineering, packing billions of transistors into a single chip. The NVIDIA RTX 3090 alone holds around 28 billion, while the latest Blackwell B200 GPU has already crossed 100 billion.

Massimo

40,811 views • 9 months ago