Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

DFlash speculative decoding on Apple Silicon Qwen3.5-9B bf16 · M5 Max · greedy exact match ▸ 85 tok/s, 3.3× at 1024 tokens (runtime) ▸ ~70 tok/s, 2.6× in the video (terminal I/O overhead) ▸ 80 tok/s, 3.1× at 2048 tokens (runtime) Currently working on: → Long context (speedup degrades... show more

bstn 👁️

1,460 subscribers

36,942 Aufrufe • vor 3 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction at long context. M5 Max 40-core GPU, 64GB, stock mlx_lm baseline: Qwen3.6-35B-A3B-4bit: ► @ 1024 · 138.3 → 300.3 tok/s (2.20x) ► @ 2048 · 135.6 → 246.4 tok/s (1.81x) ► @ 4096 · 134.5 → 208.4 tok/s (1.56x) ► @ 8192 · 133.2 → 177.4 tok/s (1.33x) Qwen3.5-27B-4bit: ► @ 1024 · 33.5 → 79.0 tok/s (2.37x) ► @ 2048 · 33.1 → 70.2 tok/s (2.12x) ► @ 4096 · 31.5 → 55.7 tok/s (1.77x) ► @ 8192 · 33.9 → 45.3 tok/s (1.34x) Working on making this usable for agentic workloads goal is to never drop below baseline at any context depth. LLM decode is memory-bandwidth bound. M5 Max runs at 614 GB/s, that's 1.5x more than M1-M4 Max (400-410 GB/s). Results will vary on lower bandwidth chips.

DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction at long context. M5 Max 40-core GPU, 64GB, stock mlx_lm baseline: Qwen3.6-35B-A3B-4bit: ► @ 1024 · 138.3 → 300.3 tok/s (2.20x) ► @ 2048 · 135.6 → 246.4 tok/s (1.81x) ► @ 4096 · 134.5 → 208.4 tok/s (1.56x) ► @ 8192 · 133.2 → 177.4 tok/s (1.33x) Qwen3.5-27B-4bit: ► @ 1024 · 33.5 → 79.0 tok/s (2.37x) ► @ 2048 · 33.1 → 70.2 tok/s (2.12x) ► @ 4096 · 31.5 → 55.7 tok/s (1.77x) ► @ 8192 · 33.9 → 45.3 tok/s (1.34x) Working on making this usable for agentic workloads goal is to never drop below baseline at any context depth. LLM decode is memory-bandwidth bound. M5 Max runs at 614 GB/s, that's 1.5x more than M1-M4 Max (400-410 GB/s). Results will vary on lower bandwidth chips.

bstn 👁️

23,120 Aufrufe • vor 3 Monaten

dflash-mlx v0.1.7 is out. Big adaptive-runtime update, still focused mostly on Qwen3.6 27B 4-bit. @ 2048 tokens, M5 Max, stock mlx_lm baseline: ► 1024: 33.26 → 98.05 tok/s (x2.95) ► 2048: 32.34 → 90.67 tok/s (x2.81) ► 4096: 30.58 → 93.55 tok/s (x3.06) ► 8192: 26.03 → 79.12 tok/s (x3.04) ► 16384: 21.50 → 60.77 tok/s (x2.78) Main change: adaptive verify got a lot smarter. Instead of blindly trying to verify large 16-token blocks all the time, DFlash now watches acceptance + tokens/cycle + real cycle cost. When the draft gets weaker, it drops to smaller 4-token blocks, then probes back up only when the recent cycles make sense. In practice: less wasted verify work, better long-context behavior, and much more useful metrics to understand what is happening. ► retuned adaptive verify for long-context / agentic decode ► richer metrics: tokens/cycle, adaptive block state, CopySpec counters ► /metrics now has real decode avg + logical/real/restored prefill rates ► AIME25 benchmark suite with exact integer scoring ► Qwen thinking default now follows tokenizer/request behavior ► GDN recurrent exactness fixes I also started running AIME25-style long generations. Even around 45k generated tokens, I was still seeing ~40 tok/s on 27B 4-bit. Over the next few days I’ll share more demos: AIME runs, real OpenCode game/project sessions, and full metrics along the way. Still optimizing hard for 27B 4-bit first, while working on custom kernels per Apple GPU generation so more machines can benefit.

dflash-mlx v0.1.7 is out. Big adaptive-runtime update, still focused mostly on Qwen3.6 27B 4-bit. @ 2048 tokens, M5 Max, stock mlx_lm baseline: ► 1024: 33.26 → 98.05 tok/s (x2.95) ► 2048: 32.34 → 90.67 tok/s (x2.81) ► 4096: 30.58 → 93.55 tok/s (x3.06) ► 8192: 26.03 → 79.12 tok/s (x3.04) ► 16384: 21.50 → 60.77 tok/s (x2.78) Main change: adaptive verify got a lot smarter. Instead of blindly trying to verify large 16-token blocks all the time, DFlash now watches acceptance + tokens/cycle + real cycle cost. When the draft gets weaker, it drops to smaller 4-token blocks, then probes back up only when the recent cycles make sense. In practice: less wasted verify work, better long-context behavior, and much more useful metrics to understand what is happening. ► retuned adaptive verify for long-context / agentic decode ► richer metrics: tokens/cycle, adaptive block state, CopySpec counters ► /metrics now has real decode avg + logical/real/restored prefill rates ► AIME25 benchmark suite with exact integer scoring ► Qwen thinking default now follows tokenizer/request behavior ► GDN recurrent exactness fixes I also started running AIME25-style long generations. Even around 45k generated tokens, I was still seeing ~40 tok/s on 27B 4-bit. Over the next few days I’ll share more demos: AIME runs, real OpenCode game/project sessions, and full metrics along the way. Still optimizing hard for 27B 4-bit first, while working on custom kernels per Apple GPU generation so more machines can benefit.

bstn 👁️

16,334 Aufrufe • vor 2 Monaten

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

52,588 Aufrufe • vor 3 Monaten

DFlash makes Qwen 2.2x faster with no quality loss! We ran the same Qwen3.6-27B locally three ways on one RTX 6000: baseline, MTP, DFlash. The tasks only differ in one thing - how predictable the next word is: quicksort, describe a file in JSON, a logic puzzle, a sci-fi story. Outputs: Baseline: 44 tok/s · 1.00x MTP: 65 tok/s · 1.45x · 71% accepted DFlash: 98 tok/s · 2.20x · 30% accepted Baseline writes one token per step. MTP works inside the model itself and guesses 3 tokens ahead. DFlash is a separate small model that writes 15 tokens at once, and the big model only checks them. In JSON the same words repeat all the time, so most guesses were right: 152 tok/s, 3.4x speedup. In the story 9 guesses out of 10 were wrong. DFlash did all that extra work for nothing and became slower than baseline: 42 vs 44 tok/s. MTP guesses only 3 tokens, so a wrong guess costs very little: 46 tok/s and the win in that round. The output is identical in all three modes - DFlash is the pick for tasks with predictable output, like coding, and for chat and creative writing MTP works better. DFlash is now natively integrated into Atomic Chat on llama.cpp - speed up your Qwen models!

DFlash makes Qwen 2.2x faster with no quality loss! We ran the same Qwen3.6-27B locally three ways on one RTX 6000: baseline, MTP, DFlash. The tasks only differ in one thing - how predictable the next word is: quicksort, describe a file in JSON, a logic puzzle, a sci-fi story. Outputs: Baseline: 44 tok/s · 1.00x MTP: 65 tok/s · 1.45x · 71% accepted DFlash: 98 tok/s · 2.20x · 30% accepted Baseline writes one token per step. MTP works inside the model itself and guesses 3 tokens ahead. DFlash is a separate small model that writes 15 tokens at once, and the big model only checks them. In JSON the same words repeat all the time, so most guesses were right: 152 tok/s, 3.4x speedup. In the story 9 guesses out of 10 were wrong. DFlash did all that extra work for nothing and became slower than baseline: 42 vs 44 tok/s. MTP guesses only 3 tokens, so a wrong guess costs very little: 46 tok/s and the win in that round. The output is identical in all three modes - DFlash is the pick for tasks with predictable output, like coding, and for chat and creative writing MTP works better. DFlash is now natively integrated into Atomic Chat on llama.cpp - speed up your Qwen models!

atomic.chat

75,519 Aufrufe • vor 14 Tagen

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

157,390 Aufrufe • vor 2 Monaten

Hot off the press 🔥 AEON vLLM Ultimate → v0.24.0. Link below ⤵️ The ultimate container for AI on NVIDIA DGX Spark — vLLM compiled from source for GB10/sm_121a. 🔓 UNLOCKED (impossible until this release): ▸ DFlash + FP8 KV cache — 2× KV capacity, draft acceptance intact, measured on-device ▸ FP8 KV on vision models — the KV-bandwidth lever is back ▸ DFlash on FlashInfer & Triton backends (was FlashAttention-only) 🛠 Carried ahead of upstream (still-open PRs, merged + validated): ▸ Triton NVFP4 KV cache → up to 3× KV capacity ▸ DFlash sliding-window drafters + prefix-cache stability for DFlash ▸ High-concurrency DFlash — clean through c=64 🧰 Fixes stock v0.24.0 doesn't have: ▸ Gemma-4 tied embeddings (ModelOpt) — stock crashes at load, ours doesn't ▸ Tool calling out of the box (stock --no-deps builds 500) ▸ Unified-memory hardened — no silent KV overcommit on the 121 GB pool 📊 One image, the whole fleet: Gemma-4-26B-A4B · Qwen3.6-27B · Qwen3.6-35B-A3B (hybrid GDN) 536 tok/s @ c16 · 472 tok/s @ c12 · 262K context · image+video · tools + reasoning parsers · full CUDA-graph capture 🐳 docker pull

Hot off the press 🔥 AEON vLLM Ultimate → v0.24.0. Link below ⤵️ The ultimate container for AI on NVIDIA DGX Spark — vLLM compiled from source for GB10/sm_121a. 🔓 UNLOCKED (impossible until this release): ▸ DFlash + FP8 KV cache — 2× KV capacity, draft acceptance intact, measured on-device ▸ FP8 KV on vision models — the KV-bandwidth lever is back ▸ DFlash on FlashInfer & Triton backends (was FlashAttention-only) 🛠 Carried ahead of upstream (still-open PRs, merged + validated): ▸ Triton NVFP4 KV cache → up to 3× KV capacity ▸ DFlash sliding-window drafters + prefix-cache stability for DFlash ▸ High-concurrency DFlash — clean through c=64 🧰 Fixes stock v0.24.0 doesn't have: ▸ Gemma-4 tied embeddings (ModelOpt) — stock crashes at load, ours doesn't ▸ Tool calling out of the box (stock --no-deps builds 500) ▸ Unified-memory hardened — no silent KV overcommit on the 121 GB pool 📊 One image, the whole fleet: Gemma-4-26B-A4B · Qwen3.6-27B · Qwen3.6-35B-A3B (hybrid GDN) 536 tok/s @ c16 · 472 tok/s @ c12 · 262K context · image+video · tools + reasoning parsers · full CUDA-graph capture 🐳 docker pull

ÆON FORGE ✨

24,715 Aufrufe • vor 25 Tagen

nvidia's 3B mamba destroyed alibaba's 3B deltanet on the same RTX 3090. only 24 days between releases. same active parameters, same VRAM tier, completely different architectures. nemotron cascade 2: 187 tok/s. flat from 4K to 625K context. zero speed loss. flags: -ngl 99 -np 1. that's it. no context flags, no KV cache tricks. auto-allocates 625K. qwen 3.5 35B-A3B: 112 tok/s. flat from 4K to 262K context. zero speed loss. flags: -ngl 99 -np 1 -c 262144 --cache-type-k q8_0 --cache-type-v q8_0. needed KV cache quantization to fit 262K. both models held a flat line across every context level. both architectures are context-independent. but nvidia's mamba2 is 67% faster at generating tokens on the exact same hardware and needs fewer flags to get there. same node, same GPU, same everything. the only variable is the model. gold medal math olympiad winner running at 187 tokens per second on single RTX 3090 a card from 6 years ago. nvidia cooked.

nvidia's 3B mamba destroyed alibaba's 3B deltanet on the same RTX 3090. only 24 days between releases. same active parameters, same VRAM tier, completely different architectures. nemotron cascade 2: 187 tok/s. flat from 4K to 625K context. zero speed loss. flags: -ngl 99 -np 1. that's it. no context flags, no KV cache tricks. auto-allocates 625K. qwen 3.5 35B-A3B: 112 tok/s. flat from 4K to 262K context. zero speed loss. flags: -ngl 99 -np 1 -c 262144 --cache-type-k q8_0 --cache-type-v q8_0. needed KV cache quantization to fit 262K. both models held a flat line across every context level. both architectures are context-independent. but nvidia's mamba2 is 67% faster at generating tokens on the exact same hardware and needs fewer flags to get there. same node, same GPU, same everything. the only variable is the model. gold medal math olympiad winner running at 187 tokens per second on single RTX 3090 a card from 6 years ago. nvidia cooked.

Sudo su

186,642 Aufrufe • vor 4 Monaten

Most speculative decoding still drafts tokens one at a time. That's not parallel generation — it just hides the serial loop behind a smaller model. UC San Diego's z-lab just drew a clear line between the two. They released DFlash — a lightweight block diffusion model that drafts a whole block of tokens in a single forward pass, then lets the target model verify the block in parallel. Up to 15× higher throughput for gpt-oss-120b on NVIDIA Blackwell. No token-by-token drafting anywhere in the speculative path. Here's what's actually interesting: → The drafter is conditioned on the target model's own hidden features, injected into the Key/Value cache of every draft layer — so acceptance length scales with draft depth instead of diluting away → A 5-layer drafter replaces the 7B diffusion drafters that capped earlier methods near 3–4× → MATH-500 speedup: 6.08× vs. 1.81× for EAGLE-3 (4.86× average vs. 1.76×, Qwen3-8B, greedy) → Up to 15× higher throughput for gpt-oss-120b on NVIDIA Blackwell — at the same interactivity target → Lossless: the target still verifies every token, so output quality is preserved Full analysis: Paper: NVIDIA's metrics: Project: Model weights: Repo: NVIDIA AI NVIDIA AIDev

Most speculative decoding still drafts tokens one at a time. That's not parallel generation — it just hides the serial loop behind a smaller model. UC San Diego's z-lab just drew a clear line between the two. They released DFlash — a lightweight block diffusion model that drafts a whole block of tokens in a single forward pass, then lets the target model verify the block in parallel. Up to 15× higher throughput for gpt-oss-120b on NVIDIA Blackwell. No token-by-token drafting anywhere in the speculative path. Here's what's actually interesting: → The drafter is conditioned on the target model's own hidden features, injected into the Key/Value cache of every draft layer — so acceptance length scales with draft depth instead of diluting away → A 5-layer drafter replaces the 7B diffusion drafters that capped earlier methods near 3–4× → MATH-500 speedup: 6.08× vs. 1.81× for EAGLE-3 (4.86× average vs. 1.76×, Qwen3-8B, greedy) → Up to 15× higher throughput for gpt-oss-120b on NVIDIA Blackwell — at the same interactivity target → Lossless: the target still verifies every token, so output quality is preserved Full analysis: Paper: NVIDIA's metrics: Project: Model weights: Repo: NVIDIA AI NVIDIA AIDev

Marktechpost AI

23,328 Aufrufe • vor 1 Monat

first impressions of qwen 3.5 27B dense on a single RTX 3090. 35 tok/s. from 4K all the way to 300K+ context. no speed drop. hermes 4.3 started at 35 and degraded to 15 as context filled. qwen dense holds. MoE held 112 flat. 3x faster but only 3B of 35B active per token. architecture tradeoff. Q4_K_M on 16.7GB. native context 262K. pushed past training limit to 376K before VRAM ceiling on 24GB. tried q8 KV cache at 262K, speed collapsed to 11 tok/s. q4_0 KV is the sweet spot. flash attention mandatory. built in reasoning mode. the model thinks step by step before it answers. full chain of thought surviving Q4 quant. 1,799+ token thinking chains with self correction loops. on a single consumer GPU. gave it one prompt: "build a realtime particle galaxy simulation in one HTML file." 3,340 tokens. 95 seconds. one shot. ran on first load. full reasoning and coding in the video below. optimal config if you want to skip the hours of testing: llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 this is just the warmup. octopus invaders is next: 10 files, 3,400+ lines, zero steering. the prompt hermes quit at 22%. already more impressed than expected. full results coming soon.

first impressions of qwen 3.5 27B dense on a single RTX 3090. 35 tok/s. from 4K all the way to 300K+ context. no speed drop. hermes 4.3 started at 35 and degraded to 15 as context filled. qwen dense holds. MoE held 112 flat. 3x faster but only 3B of 35B active per token. architecture tradeoff. Q4_K_M on 16.7GB. native context 262K. pushed past training limit to 376K before VRAM ceiling on 24GB. tried q8 KV cache at 262K, speed collapsed to 11 tok/s. q4_0 KV is the sweet spot. flash attention mandatory. built in reasoning mode. the model thinks step by step before it answers. full chain of thought surviving Q4 quant. 1,799+ token thinking chains with self correction loops. on a single consumer GPU. gave it one prompt: "build a realtime particle galaxy simulation in one HTML file." 3,340 tokens. 95 seconds. one shot. ran on first load. full reasoning and coding in the video below. optimal config if you want to skip the hours of testing: llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 this is just the warmup. octopus invaders is next: 10 files, 3,400+ lines, zero steering. the prompt hermes quit at 22%. already more impressed than expected. full results coming soon.

Sudo su

120,788 Aufrufe • vor 4 Monaten

testing Qwen3.5-35B-A3B latest optimized version by UnslothAI on a single RTX 3090. one detailed prompt. zero handholding. watch a 3B model scaffold an entire multifile game project autonomously. the setup: > model: Qwen3.5-35B-A3B (80B total, only 3B active per token) > quant: UD-Q4_K_XL by Unsloth (MXFP4 layers removed in latest update) > speed: 112 tok/s generation, ~130 tok/s prefill > context: 262K tokens > flags: -ngl 99 -c 262144 -np 1 --cache-type-k q8_0 --cache-type-v q8_0 > engine: llama.cpp > agent: Claude Code talk to localhost:8080 (llama.cpp now has native Anthropic API endpoint. no LiteLLM needed) q8_0 KV cache cuts VRAM usage in half vs f16 at 262K. -np 1 is default but worth noting. parallel slots multiply KV cache and at 262K that's an instant OOM. the prompt was more detailed than this but you get the idea: build a space shooter with parallax backgrounds, particle systems, procedural audio, 4 enemy types, boss fights, power-up system, and ship upgrades. 8 JavaScript modules. no libraries. game's called Octopus Invaders. gameplay footage dropping next.

testing Qwen3.5-35B-A3B latest optimized version by UnslothAI on a single RTX 3090. one detailed prompt. zero handholding. watch a 3B model scaffold an entire multifile game project autonomously. the setup: > model: Qwen3.5-35B-A3B (80B total, only 3B active per token) > quant: UD-Q4_K_XL by Unsloth (MXFP4 layers removed in latest update) > speed: 112 tok/s generation, ~130 tok/s prefill > context: 262K tokens > flags: -ngl 99 -c 262144 -np 1 --cache-type-k q8_0 --cache-type-v q8_0 > engine: llama.cpp > agent: Claude Code talk to localhost:8080 (llama.cpp now has native Anthropic API endpoint. no LiteLLM needed) q8_0 KV cache cuts VRAM usage in half vs f16 at 262K. -np 1 is default but worth noting. parallel slots multiply KV cache and at 262K that's an instant OOM. the prompt was more detailed than this but you get the idea: build a space shooter with parallax backgrounds, particle systems, procedural audio, 4 enemy types, boss fights, power-up system, and ship upgrades. 8 JavaScript modules. no libraries. game's called Octopus Invaders. gameplay footage dropping next.

Sudo su

167,219 Aufrufe • vor 4 Monaten

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Alok

119,821 Aufrufe • vor 1 Monat

$I nearly 2x'd the speed while only using +1GB VRAM with the new MTP update in llama.cpp 🤯 You need to add these flags to start using it: --spec-type draft-mtp \ --spec-draft-p-min 0.75 \ --spec-draft-n-max 2 My results with Qwen3.6 27B on a single RTX 5080 ↓ ⚪️ no flag (without mtp) → 54.3 tok/s with 13.26GB VRAM 🔵 --spec-draft-n-max 2 → 90.7 tok/s with 14.29GB VRAM 🔴 --spec-draft-n-max 2 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.30GB VRAM 🟢 --spec-draft-n-max 6 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.87GB VRAM Increasing to 6 draft tokens didn't help my setup for some reason. I made sure to test with a low context length to have enough headroom and eliminate risk of vram stress. From my understanding: 1) The speed gains are very task-dependent. You need to test across a wide range of tasks to get a realistic idea of the benefits 2) We’re already running heavily quantized GGUF models (Q3, Q4, Q6, etc.), so we already benefit from strong speed/performance thanks to the reduced size. That’s why some people are seeing little to no improvement compared to MLX or other quantized versions The progress over the past few days has been insane to say the least. However, MTP now consumes significantly more VRAM. Personally 16GB just isn't enough to use MTP and run it with a good context size. Time to upgrade lads, 24GB+ users are eating GOOD today 🔥 Full setup below ↓$

I nearly 2x'd the speed while only using +1GB VRAM with the new MTP update in llama.cpp 🤯 You need to add these flags to start using it: --spec-type draft-mtp \ --spec-draft-p-min 0.75 \ --spec-draft-n-max 2 My results with Qwen3.6 27B on a single RTX 5080 ↓ ⚪️ no flag (without mtp) → 54.3 tok/s with 13.26GB VRAM 🔵 --spec-draft-n-max 2 → 90.7 tok/s with 14.29GB VRAM 🔴 --spec-draft-n-max 2 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.30GB VRAM 🟢 --spec-draft-n-max 6 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.87GB VRAM Increasing to 6 draft tokens didn't help my setup for some reason. I made sure to test with a low context length to have enough headroom and eliminate risk of vram stress. From my understanding: 1) The speed gains are very task-dependent. You need to test across a wide range of tasks to get a realistic idea of the benefits 2) We’re already running heavily quantized GGUF models (Q3, Q4, Q6, etc.), so we already benefit from strong speed/performance thanks to the reduced size. That’s why some people are seeing little to no improvement compared to MLX or other quantized versions The progress over the past few days has been insane to say the least. However, MTP now consumes significantly more VRAM. Personally 16GB just isn't enough to use MTP and run it with a good context size. Time to upgrade lads, 24GB+ users are eating GOOD today 🔥 Full setup below ↓

left curve dev

29,946 Aufrufe • vor 2 Monaten

A single RTX 4090 (24 GB VRAM) can run the updated gemma 4 31B (dense) model with a 190,000 context window at 33 tokens/second. The VRAM barrier is dying. Google quietly updated Gemma 4, and Unsloth immediately compiled the new quants. I built llama.cpp from source on Ubuntu 22 to benchmark it. Google's stealth update 2 days ago enabled uniform Flash Attention 4 on Hopper to boost prefill and patched the chat template to improve tool calling. The agentic reasoning gains on the benchmark charts are massive: TB2 (Agents): +4.5% (to 25.8%) Tau2 (Telecom): +10.1% (to 62.7%) Running on Ubuntu 22, CUDA 13.0 with a single NVIDIA GeForce RTX 4090. Here is the exact step by step benchmarking process with a massive 28k tokens prompt and the commands I used to squeeze out maximum context without killing my throughput: # 1. The Baseline (Unquantized KV Cache) I started with full GPU offload (-ngl 99) and pushed the context to 40k. llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -ngl 99 -c 40000 -fa on --port 8080 -v VRAM: 23.8 GB (maxed out on card) Throughput: Prefill: 2198.81 t/s | Decode: 35.77 t/s (with 28k tokens prompt) # 2. The CPU Split Trap I tried stretching to 80k context by offloading layers to the CPU (-ngl 52). llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 80000 -ngl 52 -fa on --port 8080 -v Throughput: Prefill: 1212.73 t/s | Decode: 5 t/s (with 28k tokens prompt) # 3. The KV Quantization Breakthrough Instead of spilling layers to the CPU, I kept the model fully on card (-ngl 99) but enabled 8-bit KV cache quantization to free up VRAM. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --port 8080 -v VRAM: 23.9 GB Throughput: Prefill: 2139.68 t/s | Decode: 32 t/s (with 28k tokens prompt) Result: 100k tokens of context on a single GPU with practically zero speed loss (and minimal intelligence loss). # 4. The Limit Test (Q4 KV Cache) To find the absolute breaking point, I dropped the KV cache to 4 bit (q4_0) and set -c 190000. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 190000 --cache-type-k q4_0 --cache-type-v q4_0 -ngl 99 --port 8080 -v VRAM: 23.8 GB Throughput: Prefill: 2206.66 t/s | Decode: 33 t/s (with 28k tokens prompt) (Note: Pushing it to 220k required dropping to -ngl 58 again, which immediately penalized decode down to 17 t/s). # The Tradeoff: For Max Reasoning: Keep your KV cache unquantized (f16). You get pristine reasoning but hit a strict 40k context ceiling. For Massive Document Retrieval: If you need to feed the model giant codebases, use --cache-type-k q4_0. Getting 190k context at 33 tokens/second on a consumer desktop with a 31b dense model is a cheat code. If you’re rocking a single 3090 or 4090 and slept on Gemma 4 earlier, this update is your cue to dust off the terminal. Hugging Face links to the Unsloth QAT quants are in the replies below.

A single RTX 4090 (24 GB VRAM) can run the updated gemma 4 31B (dense) model with a 190,000 context window at 33 tokens/second. The VRAM barrier is dying. Google quietly updated Gemma 4, and Unsloth immediately compiled the new quants. I built llama.cpp from source on Ubuntu 22 to benchmark it. Google's stealth update 2 days ago enabled uniform Flash Attention 4 on Hopper to boost prefill and patched the chat template to improve tool calling. The agentic reasoning gains on the benchmark charts are massive: TB2 (Agents): +4.5% (to 25.8%) Tau2 (Telecom): +10.1% (to 62.7%) Running on Ubuntu 22, CUDA 13.0 with a single NVIDIA GeForce RTX 4090. Here is the exact step by step benchmarking process with a massive 28k tokens prompt and the commands I used to squeeze out maximum context without killing my throughput: # 1. The Baseline (Unquantized KV Cache) I started with full GPU offload (-ngl 99) and pushed the context to 40k. llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -ngl 99 -c 40000 -fa on --port 8080 -v VRAM: 23.8 GB (maxed out on card) Throughput: Prefill: 2198.81 t/s | Decode: 35.77 t/s (with 28k tokens prompt) # 2. The CPU Split Trap I tried stretching to 80k context by offloading layers to the CPU (-ngl 52). llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 80000 -ngl 52 -fa on --port 8080 -v Throughput: Prefill: 1212.73 t/s | Decode: 5 t/s (with 28k tokens prompt) # 3. The KV Quantization Breakthrough Instead of spilling layers to the CPU, I kept the model fully on card (-ngl 99) but enabled 8-bit KV cache quantization to free up VRAM. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --port 8080 -v VRAM: 23.9 GB Throughput: Prefill: 2139.68 t/s | Decode: 32 t/s (with 28k tokens prompt) Result: 100k tokens of context on a single GPU with practically zero speed loss (and minimal intelligence loss). # 4. The Limit Test (Q4 KV Cache) To find the absolute breaking point, I dropped the KV cache to 4 bit (q4_0) and set -c 190000. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 190000 --cache-type-k q4_0 --cache-type-v q4_0 -ngl 99 --port 8080 -v VRAM: 23.8 GB Throughput: Prefill: 2206.66 t/s | Decode: 33 t/s (with 28k tokens prompt) (Note: Pushing it to 220k required dropping to -ngl 58 again, which immediately penalized decode down to 17 t/s). # The Tradeoff: For Max Reasoning: Keep your KV cache unquantized (f16). You get pristine reasoning but hit a strict 40k context ceiling. For Massive Document Retrieval: If you need to feed the model giant codebases, use --cache-type-k q4_0. Getting 190k context at 33 tokens/second on a consumer desktop with a 31b dense model is a cheat code. If you’re rocking a single 3090 or 4090 and slept on Gemma 4 earlier, this update is your cue to dust off the terminal. Hugging Face links to the Unsloth QAT quants are in the replies below.

Alok

75,205 Aufrufe • vor 9 Tagen

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

Avi Chawla

269,149 Aufrufe • vor 1 Monat

update: qwen 3.6 27b dense q4 just one shotted octopus invaders game on a single 3090. hermes agent drove the whole thing, ~41 tok/s gen 21gb vram at full 262k context, thinking mode on. one prompt in and the canonical multi-file space shooter benchmark out, the same exact prompt i ran on qwen 3.5 27b dense back in march on the same card. 3.5 needed one external scope bug fix before the game would even load on first play. 3.6 needed nothing. 11 of 11 files written, 2411 lines of code, zero steering interventions, zero external fixes, playable on first load. 16 minutes 41 seconds wall clock from prompt to playable. consumer tier king on a single 3090 is locked tonight, and the silicon underneath my desk did not change between march and now. the open source ecosystem just moved the floor. watch it ship itself, the full 16 minutes 41 seconds sped to 3 minutes 45, no human touched the keyboard between the first prompt and the final frame.

update: qwen 3.6 27b dense q4 just one shotted octopus invaders game on a single 3090. hermes agent drove the whole thing, ~41 tok/s gen 21gb vram at full 262k context, thinking mode on. one prompt in and the canonical multi-file space shooter benchmark out, the same exact prompt i ran on qwen 3.5 27b dense back in march on the same card. 3.5 needed one external scope bug fix before the game would even load on first play. 3.6 needed nothing. 11 of 11 files written, 2411 lines of code, zero steering interventions, zero external fixes, playable on first load. 16 minutes 41 seconds wall clock from prompt to playable. consumer tier king on a single 3090 is locked tonight, and the silicon underneath my desk did not change between march and now. the open source ecosystem just moved the floor. watch it ship itself, the full 16 minutes 41 seconds sped to 3 minutes 45, no human touched the keyboard between the first prompt and the final frame.

Sudo su

123,876 Aufrufe • vor 2 Monaten

Solo dev reverse-engineered Google's billion-dollar algorithm in 7 days Google published the paper that crashed memory stocks worldwide. Then shipped zero code. Tom Turney read the math, opened his terminal, and built the whole thing with Claude - then made it faster than Google promised. Day 1-3: Core algorithms, 141 tests, Python prototype Day 3-5: C port into llama.cpp, Metal GPU kernels Day 5-7: Speed optimization from 739 to 2747 tok/s That's a 3.7x speedup through pure engineering: > fp32 → fp16 WHT > half4 vectorized butterfly ops > graph-side rotation > block-32 storage layout Then he added his own research on top: > Sparse V: skip 90% of value decompressions at long context > Asymmetric K/V: keep keys precise, compress values harder > Temporal decay: old tokens get lower precision automatically Result: 35B model running on a MacBook with 4.6x compressed cache. 613 GitHub stars in a week. Google still hasn't released their own code.

Solo dev reverse-engineered Google's billion-dollar algorithm in 7 days Google published the paper that crashed memory stocks worldwide. Then shipped zero code. Tom Turney read the math, opened his terminal, and built the whole thing with Claude - then made it faster than Google promised. Day 1-3: Core algorithms, 141 tests, Python prototype Day 3-5: C port into llama.cpp, Metal GPU kernels Day 5-7: Speed optimization from 739 to 2747 tok/s That's a 3.7x speedup through pure engineering: > fp32 → fp16 WHT > half4 vectorized butterfly ops > graph-side rotation > block-32 storage layout Then he added his own research on top: > Sparse V: skip 90% of value decompressions at long context > Asymmetric K/V: keep keys precise, compress values harder > Temporal decay: old tokens get lower precision automatically Result: 35B model running on a MacBook with 4.6x compressed cache. 613 GitHub stars in a week. Google still hasn't released their own code.

BuBBliK

1,660,689 Aufrufe • vor 3 Monaten

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. Kog just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. Kog just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

Rohan Paul

13,148 Aufrufe • vor 1 Monat

There is a strategy for Claude Max users that will 3-5X productivity and vibe coding learning curve. Which most vibe coders don't utilize! Most people using Claude Code just sit and watch the terminal while their prompt cooks. That’s dead time. If you’re on Claude Max, you’re paying for parallel capacity you’re probably not using fully. And tokens get wasted. Here’s a methodology to fix that and boost the way you use Claude forever: The Setup & Workflow 1. Add the repos you’re actively working on to your workspace from your IDE 2. Click “new terminal” and open 3-4 terminal instances, each pointed at a different project 3. Toggle between them on the right side of your terminal panel 4. When one terminal is cooking a prompt, look at the others 5. Use the time Claude is working to prompt the next project 6. Continuously rotate and ship across 3-4 projects like this 7. Use your tokens effectively, build more, learn faster The Result You ship 3-4 projects simultaneously. You burn through your token allocation on actual output instead of letting it sit unused. You learn faster because you’re getting more reps in the same amount of time, too. As long as you work on different projects at the same time like this, it will only be a boost in productivity, with no chance for code to get spaghetti. In the video, you can see where to find add workspace, run new terminal, and change between terminals:

There is a strategy for Claude Max users that will 3-5X productivity and vibe coding learning curve. Which most vibe coders don't utilize! Most people using Claude Code just sit and watch the terminal while their prompt cooks. That’s dead time. If you’re on Claude Max, you’re paying for parallel capacity you’re probably not using fully. And tokens get wasted. Here’s a methodology to fix that and boost the way you use Claude forever: The Setup & Workflow 1. Add the repos you’re actively working on to your workspace from your IDE 2. Click “new terminal” and open 3-4 terminal instances, each pointed at a different project 3. Toggle between them on the right side of your terminal panel 4. When one terminal is cooking a prompt, look at the others 5. Use the time Claude is working to prompt the next project 6. Continuously rotate and ship across 3-4 projects like this 7. Use your tokens effectively, build more, learn faster The Result You ship 3-4 projects simultaneously. You burn through your token allocation on actual output instead of letting it sit unused. You learn faster because you’re getting more reps in the same amount of time, too. As long as you work on different projects at the same time like this, it will only be a boost in productivity, with no chance for code to get spaghetti. In the video, you can see where to find add workspace, run new terminal, and change between terminals:

Meta Alchemist

82,000 Aufrufe • vor 6 Monaten

Alibaba just released a coding model that hits 82 percent on SWE-Bench Verified. That is the highest score ever published for an open-source model. The weights are free. The license is Apache 2.0. You can run it today. The model is Qwen 4 Coder 32B. Here is what 82 percent on SWE-Bench Verified actually means. SWE-Bench Verified tests whether an AI can autonomously resolve real bugs pulled from real production GitHub repositories. Not synthetic exercises. Real open-source projects that real teams depend on. A model gets a bug report, reads the code, writes a fix, and either passes the test suite or it does not. At 82 percent, Qwen 4 Coder 32B resolves 82 out of every 100 real production bugs it is given. Without a human guiding it. On code it has never seen before. For comparison: Qwen 4 Coder 32B: 82 percent SWE-Bench Verified. Open source. Apache 2.0. Claude Fable 5: 80.3 percent SWE-Bench Pro. $10 input / $50 output per million tokens. Currently suspended. GPT-5.6 Sol: Competitive on Terminal-Bench. $5 input / $30 output per million tokens. An open-weight model that you can download and run for free just beat both of them on the benchmark designed to measure real software engineering capability. Here is the architecture. Qwen 4 Coder 32B is a 32 billion parameter dense model. Not a Mixture-of-Experts. Every parameter is active on every request. This matters for inference: a dense 32B model runs on 22 gigabytes of VRAM, which fits on a single high-end consumer GPU or a MacBook Pro with 64GB of unified memory. The smaller variant, Qwen 4 Coder 4B, runs at approximately 135 tokens per second on an M5 Max and fits inside 8 gigabytes of RAM. For a model with usable coding capability, that is a new bar for what fits in a single laptop. The training methodology continued Alibaba's approach of reinforcement learning on verifiable coding tasks. The model gets rewarded when its code passes tests. It gets penalized when it fails. Over millions of training steps, the model learns to write code that actually runs rather than code that looks plausible. License: Apache 2.0. Full commercial use. No attribution requirement. No revenue threshold. No monthly active user ceiling. Weights: Hugging Face, available today. Runs on: vLLM, Ollama, SGLang, and any standard GGUF-compatible inference engine. Qwen 4 32B also runs at approximately 135 tokens per second on an M5 Max chip, setting a new bar for what a sub-8GB model can do on Apple Silicon. The open-source coding model just beat the best closed-source model in the world on the benchmark designed to test whether AI can actually do software engineering. The weights are free. The subscription is optional. Source: Autom8Labs AI Insight July 2026, State of Open Source LLMs June 2026, Kunal Ganglani blog June 2026.

Alibaba just released a coding model that hits 82 percent on SWE-Bench Verified. That is the highest score ever published for an open-source model. The weights are free. The license is Apache 2.0. You can run it today. The model is Qwen 4 Coder 32B. Here is what 82 percent on SWE-Bench Verified actually means. SWE-Bench Verified tests whether an AI can autonomously resolve real bugs pulled from real production GitHub repositories. Not synthetic exercises. Real open-source projects that real teams depend on. A model gets a bug report, reads the code, writes a fix, and either passes the test suite or it does not. At 82 percent, Qwen 4 Coder 32B resolves 82 out of every 100 real production bugs it is given. Without a human guiding it. On code it has never seen before. For comparison: Qwen 4 Coder 32B: 82 percent SWE-Bench Verified. Open source. Apache 2.0. Claude Fable 5: 80.3 percent SWE-Bench Pro. $10 input / $50 output per million tokens. Currently suspended. GPT-5.6 Sol: Competitive on Terminal-Bench. $5 input / $30 output per million tokens. An open-weight model that you can download and run for free just beat both of them on the benchmark designed to measure real software engineering capability. Here is the architecture. Qwen 4 Coder 32B is a 32 billion parameter dense model. Not a Mixture-of-Experts. Every parameter is active on every request. This matters for inference: a dense 32B model runs on 22 gigabytes of VRAM, which fits on a single high-end consumer GPU or a MacBook Pro with 64GB of unified memory. The smaller variant, Qwen 4 Coder 4B, runs at approximately 135 tokens per second on an M5 Max and fits inside 8 gigabytes of RAM. For a model with usable coding capability, that is a new bar for what fits in a single laptop. The training methodology continued Alibaba's approach of reinforcement learning on verifiable coding tasks. The model gets rewarded when its code passes tests. It gets penalized when it fails. Over millions of training steps, the model learns to write code that actually runs rather than code that looks plausible. License: Apache 2.0. Full commercial use. No attribution requirement. No revenue threshold. No monthly active user ceiling. Weights: Hugging Face, available today. Runs on: vLLM, Ollama, SGLang, and any standard GGUF-compatible inference engine. Qwen 4 32B also runs at approximately 135 tokens per second on an M5 Max chip, setting a new bar for what a sub-8GB model can do on Apple Silicon. The open-source coding model just beat the best closed-source model in the world on the benchmark designed to test whether AI can actually do software engineering. The weights are free. The subscription is optional. Source: Autom8Labs AI Insight July 2026, State of Open Source LLMs June 2026, Kunal Ganglani blog June 2026.

Harman

41,179 Aufrufe • vor 19 Tagen

Rene Haas just confirmed the Vera CPU thesis on yesterday’s Arm Q4 call. He didn’t mean to His framing: GPUs are reticle-limited. CPUs are not. The ratio shift is happening in core count, not chip count His exact words: “256 Vera CPU chips, 88 cores per chip, a 200-kilowatt liquid-cooled rack designed to sit in a data center adjacent to a Vera Rubin system” That is not a host CPU. That is a dedicated agentic orchestration Two days ago NVIDIA’s own engineers published the receipt. They traced a real 33-minute Claude Code session: 283 inference requests 58 main-agent turns coordinating 225 sub-agent invocations Context grew from 15K to 156K tokens before compaction dropped it to 20K Main agent alone processed ~3.5 million input tokens in the first 40 turns Anthropic’s own number: agentic systems consume up to 15x more tokens than chat. Coding agents sustain 95 to 98 percent prompt cache hit rates. Without caching, costs would be 6x higher This is what’s happening between GPU calls. File reads. Tool invocations. Sub-agent spawns. Compaction. KV cache management. None of it runs on the GPU That’s why 12,000 GPUs need 400,000 CPU cores. The 33-to-1 ratio isn’t a forecast. It’s a measurement NVIDIA states it in the blog directly: this won’t be resolved by adding more compute FLOPs and memory capacity Translation: the GPU-only path is exhausted. The agentic chapter requires a platform, not a chip Their seven-chip answer: Vera Rubin NVL72 —capacity and prefill Vera CPU — tool execution, KV cache offload Groq 3 LPX — SRAM-first decode, low-jitter generation NVLink 6, ConnectX-9, BlueField-4, Spectrum-X — fabric Result they claim: 400+ tokens per second per user on trillion-parameter MoE at 400K context. Vera spec: 88 Olympus cores, 176 threads, 1.8 TB/s NVLink-C2C, 1.2 TB/s LPDDR5X, 227 billion transistors. A 256-CPU rack delivers 45,056 threads and 400 TB of memory One detail nobody is talking about. The blog’s second author was previously Head of Agents at Groq. The third was previously at Groq Inc and Intel. NVIDIA didn’t license the LPX architecture. They absorbed the team that built it Haas isn’t pitching a competing thesis. He’s confirming this one from the other side of the table. Arm data center royalties doubled year-on-year. He expects them to double again Things feel slow right now because we’re between platforms. The speedup ships in H2 2026. The architectural argument is over. Deployment is the only variable left I cover this in The Quiet Architect and The Fourth Piece $arm $NVDA

Rene Haas just confirmed the Vera CPU thesis on yesterday’s Arm Q4 call. He didn’t mean to His framing: GPUs are reticle-limited. CPUs are not. The ratio shift is happening in core count, not chip count His exact words: “256 Vera CPU chips, 88 cores per chip, a 200-kilowatt liquid-cooled rack designed to sit in a data center adjacent to a Vera Rubin system” That is not a host CPU. That is a dedicated agentic orchestration Two days ago NVIDIA’s own engineers published the receipt. They traced a real 33-minute Claude Code session: 283 inference requests 58 main-agent turns coordinating 225 sub-agent invocations Context grew from 15K to 156K tokens before compaction dropped it to 20K Main agent alone processed ~3.5 million input tokens in the first 40 turns Anthropic’s own number: agentic systems consume up to 15x more tokens than chat. Coding agents sustain 95 to 98 percent prompt cache hit rates. Without caching, costs would be 6x higher This is what’s happening between GPU calls. File reads. Tool invocations. Sub-agent spawns. Compaction. KV cache management. None of it runs on the GPU That’s why 12,000 GPUs need 400,000 CPU cores. The 33-to-1 ratio isn’t a forecast. It’s a measurement NVIDIA states it in the blog directly: this won’t be resolved by adding more compute FLOPs and memory capacity Translation: the GPU-only path is exhausted. The agentic chapter requires a platform, not a chip Their seven-chip answer: Vera Rubin NVL72 —capacity and prefill Vera CPU — tool execution, KV cache offload Groq 3 LPX — SRAM-first decode, low-jitter generation NVLink 6, ConnectX-9, BlueField-4, Spectrum-X — fabric Result they claim: 400+ tokens per second per user on trillion-parameter MoE at 400K context. Vera spec: 88 Olympus cores, 176 threads, 1.8 TB/s NVLink-C2C, 1.2 TB/s LPDDR5X, 227 billion transistors. A 256-CPU rack delivers 45,056 threads and 400 TB of memory One detail nobody is talking about. The blog’s second author was previously Head of Agents at Groq. The third was previously at Groq Inc and Intel. NVIDIA didn’t license the LPX architecture. They absorbed the team that built it Haas isn’t pitching a competing thesis. He’s confirming this one from the other side of the table. Arm data center royalties doubled year-on-year. He expects them to double again Things feel slow right now because we’re between platforms. The speedup ships in H2 2026. The architectural argument is over. Deployment is the only variable left I cover this in The Quiet Architect and The Fourth Piece $arm $NVDA

Ben Pouladian

62,986 Aufrufe • vor 2 Monaten