Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction at long context. M5 Max 40-core GPU, 64GB, stock mlx_lm baseline: Qwen3.6-35B-A3B-4bit: ► @ 1024 · 138.3 → 300.3 tok/s (2.20x) ► @ 2048 · 135.6 → 246.4 tok/s (1.81x) ► @... 4096 · 134.5 → 208.4 tok/s (1.56x) ► @ 8192 · 133.2 → 177.4 tok/s (1.33x) Qwen3.5-27B-4bit: ► @ 1024 · 33.5 → 79.0 tok/s (2.37x) ► @ 2048 · 33.1 → 70.2 tok/s (2.12x) ► @ 4096 · 31.5 → 55.7 tok/s (1.77x) ► @ 8192 · 33.9 → 45.3 tok/s (1.34x) Working on making this usable for agentic workloads goal is to never drop below baseline at any context depth. LLM decode is memory-bandwidth bound. M5 Max runs at 614 GB/s, that's 1.5x more than M1-M4 Max (400-410 GB/s). Results will vary on lower bandwidth chips.show more

bstn 👁️

1,321 subscribers

23,120 views • 3 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

DFlash speculative decoding on Apple Silicon Qwen3.5-9B bf16 · M5 Max · greedy exact match ▸ 85 tok/s, 3.3× at 1024 tokens (runtime) ▸ ~70 tok/s, 2.6× in the video (terminal I/O overhead) ▸ 80 tok/s, 3.1× at 2048 tokens (runtime) Currently working on: → Long context (speedup degrades past 4K tokens, KV cache growth) → Int4 quantized models (27B class) Built on MLX, no CUDA, single machine. Draft generates 16 tokens in parallel, target verifies in one forward pass. Will open source when ready.

DFlash speculative decoding on Apple Silicon Qwen3.5-9B bf16 · M5 Max · greedy exact match ▸ 85 tok/s, 3.3× at 1024 tokens (runtime) ▸ ~70 tok/s, 2.6× in the video (terminal I/O overhead) ▸ 80 tok/s, 3.1× at 2048 tokens (runtime) Currently working on: → Long context (speedup degrades past 4K tokens, KV cache growth) → Int4 quantized models (27B class) Built on MLX, no CUDA, single machine. Draft generates 16 tokens in parallel, target verifies in one forward pass. Will open source when ready.

bstn 👁️

36,942 views • 3 months ago

dflash-mlx v0.1.7 is out. Big adaptive-runtime update, still focused mostly on Qwen3.6 27B 4-bit. @ 2048 tokens, M5 Max, stock mlx_lm baseline: ► 1024: 33.26 → 98.05 tok/s (x2.95) ► 2048: 32.34 → 90.67 tok/s (x2.81) ► 4096: 30.58 → 93.55 tok/s (x3.06) ► 8192: 26.03 → 79.12 tok/s (x3.04) ► 16384: 21.50 → 60.77 tok/s (x2.78) Main change: adaptive verify got a lot smarter. Instead of blindly trying to verify large 16-token blocks all the time, DFlash now watches acceptance + tokens/cycle + real cycle cost. When the draft gets weaker, it drops to smaller 4-token blocks, then probes back up only when the recent cycles make sense. In practice: less wasted verify work, better long-context behavior, and much more useful metrics to understand what is happening. ► retuned adaptive verify for long-context / agentic decode ► richer metrics: tokens/cycle, adaptive block state, CopySpec counters ► /metrics now has real decode avg + logical/real/restored prefill rates ► AIME25 benchmark suite with exact integer scoring ► Qwen thinking default now follows tokenizer/request behavior ► GDN recurrent exactness fixes I also started running AIME25-style long generations. Even around 45k generated tokens, I was still seeing ~40 tok/s on 27B 4-bit. Over the next few days I’ll share more demos: AIME runs, real OpenCode game/project sessions, and full metrics along the way. Still optimizing hard for 27B 4-bit first, while working on custom kernels per Apple GPU generation so more machines can benefit.

dflash-mlx v0.1.7 is out. Big adaptive-runtime update, still focused mostly on Qwen3.6 27B 4-bit. @ 2048 tokens, M5 Max, stock mlx_lm baseline: ► 1024: 33.26 → 98.05 tok/s (x2.95) ► 2048: 32.34 → 90.67 tok/s (x2.81) ► 4096: 30.58 → 93.55 tok/s (x3.06) ► 8192: 26.03 → 79.12 tok/s (x3.04) ► 16384: 21.50 → 60.77 tok/s (x2.78) Main change: adaptive verify got a lot smarter. Instead of blindly trying to verify large 16-token blocks all the time, DFlash now watches acceptance + tokens/cycle + real cycle cost. When the draft gets weaker, it drops to smaller 4-token blocks, then probes back up only when the recent cycles make sense. In practice: less wasted verify work, better long-context behavior, and much more useful metrics to understand what is happening. ► retuned adaptive verify for long-context / agentic decode ► richer metrics: tokens/cycle, adaptive block state, CopySpec counters ► /metrics now has real decode avg + logical/real/restored prefill rates ► AIME25 benchmark suite with exact integer scoring ► Qwen thinking default now follows tokenizer/request behavior ► GDN recurrent exactness fixes I also started running AIME25-style long generations. Even around 45k generated tokens, I was still seeing ~40 tok/s on 27B 4-bit. Over the next few days I’ll share more demos: AIME runs, real OpenCode game/project sessions, and full metrics along the way. Still optimizing hard for 27B 4-bit first, while working on custom kernels per Apple GPU generation so more machines can benefit.

bstn 👁️

16,334 views • 2 months ago

The RTX 3090 is a 5-year-old GPU and it still runs a 27B model at 20 tok/s I tested Qwen3.5:27b across 3 generations of NVIDIA: 5090 → ~60 tok/s 4090 → ~40 tok/s 3090 → ~20 tok/s Perfectly linear scaling. Double the generation, double the speed.

The RTX 3090 is a 5-year-old GPU and it still runs a 27B model at 20 tok/s I tested Qwen3.5:27b across 3 generations of NVIDIA: 5090 → ~60 tok/s 4090 → ~40 tok/s 3090 → ~20 tok/s Perfectly linear scaling. Double the generation, double the speed.

stevibe

142,493 views • 4 months ago

"Why are you benchmarking DGX Spark? It's a training box." Yeah. Low bandwidth, but 128GB of unified memory is just sitting there. Plenty of room to optimize. DGX Spark + Qwen3.6 27B. Four backend/quant combos: 🔴 llama.cpp + UD_Q4_K_XL > 11.0 tok/s (baseline), TTFT 297ms 🟢 llama.cpp + DFlash > 20.4 tok/s (peaks at 97 tok/s), TTFT 320ms 🟡 vLLM FP8 + MTP > 13.1 tok/s, TTFT 540ms 🟣 vLLM NVFP4 + MTP > 24.2 tok/s, TTFT 376ms NVFP4+MTP is the winner for me, rock stable around 24 tok/s, no wild swings. DFlash is the wildcard: massive peaks, but fluctuates a lot. FP8+MTP barely beats baseline, and it's FP8. Love my Spark.

"Why are you benchmarking DGX Spark? It's a training box." Yeah. Low bandwidth, but 128GB of unified memory is just sitting there. Plenty of room to optimize. DGX Spark + Qwen3.6 27B. Four backend/quant combos: 🔴 llama.cpp + UD_Q4_K_XL > 11.0 tok/s (baseline), TTFT 297ms 🟢 llama.cpp + DFlash > 20.4 tok/s (peaks at 97 tok/s), TTFT 320ms 🟡 vLLM FP8 + MTP > 13.1 tok/s, TTFT 540ms 🟣 vLLM NVFP4 + MTP > 24.2 tok/s, TTFT 376ms NVFP4+MTP is the winner for me, rock stable around 24 tok/s, no wild swings. DFlash is the wildcard: massive peaks, but fluctuates a lot. FP8+MTP barely beats baseline, and it's FP8. Love my Spark.

stevibe

40,118 views • 2 months ago

Qwen3.5:9b reasoning head-to-head: Mac Studio M2 Ultra 64GB: 43.08 tok/s Mac Mini M4 16GB: 13.07 tok/s Qwen

Qwen3.5:9b reasoning head-to-head: Mac Studio M2 Ultra 64GB: 43.08 tok/s Mac Mini M4 16GB: 13.07 tok/s Qwen

stevibe

243,343 views • 5 months ago

GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45.6 tok/s decode (gen) - 1340 tok/s prefill I could get 2x decode if I limit to 64k context (100 tok/s) In this video it operates Figma (:

GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45.6 tok/s decode (gen) - 1340 tok/s prefill I could get 2x decode if I limit to 64k context (100 tok/s) In this video it operates Figma (:

0xSero

74,772 views • 3 months ago

This is WILD ! 2 x 3090s 🚀 qwen3.6-35b-a3b-autoround on 3090 35B 32 parallel coding agents • Total: 25,943 tokens in 17s • Aggregate: 252275.4 tok/s peak · 1548.1 tok/s sustained • Per-stream: 20000.0 high / 1666.7 low / 9567.7 avg tok/s • TTFT 0.43s avg · E2E 17s avg

This is WILD ! 2 x 3090s 🚀 qwen3.6-35b-a3b-autoround on 3090 35B 32 parallel coding agents • Total: 25,943 tokens in 17s • Aggregate: 252275.4 tok/s peak · 1548.1 tok/s sustained • Per-stream: 20000.0 high / 1666.7 low / 9567.7 avg tok/s • TTFT 0.43s avg · E2E 17s avg

Tech2Wild

14,587 views • 26 days ago

Curious about ollama kimi-k2.6:cloud speed? 3 test runs: > 77.9 tok/s, TTFT 979ms > 114.3 tok/s, TTFT 788ms > 86.3 tok/s, TTFT 1117ms For comparison, OpenRouter stats: > Parasail: 14 tok/s > Moonshot AI: 27 tok/s > NovitaAI: 27 tok/s > Cloudflare: 71 tok/s Obvious caveat: cloud speeds fluctuate with load. Just sharing numbers for the curious.

Curious about ollama kimi-k2.6:cloud speed? 3 test runs: > 77.9 tok/s, TTFT 979ms > 114.3 tok/s, TTFT 788ms > 86.3 tok/s, TTFT 1117ms For comparison, OpenRouter stats: > Parasail: 14 tok/s > Moonshot AI: 27 tok/s > NovitaAI: 27 tok/s > Cloudflare: 71 tok/s Obvious caveat: cloud speeds fluctuate with load. Just sharing numbers for the curious.

stevibe

35,129 views • 3 months ago

Everyone's comparing the DGX Spark to a 5090 and calling it slow. I think that's the wrong comparison. I ran Qwen3.6 35B-A3B FP8 with the full 262K context window enabled (~96GB RAM) — something gaming GPUs can't really do. Results: 🟢No context: 51.3 tok/s, TTFT 110ms 🟣200K prefill: 34.6 tok/s, TTFT 85s (~2,341 tok/s prefill) Prefill is way faster than a Mac. And 35 tok/s deep into 200K context, on a model this strong, is genuinely usable. The Spark plays a different game.

Everyone's comparing the DGX Spark to a 5090 and calling it slow. I think that's the wrong comparison. I ran Qwen3.6 35B-A3B FP8 with the full 262K context window enabled (~96GB RAM) — something gaming GPUs can't really do. Results: 🟢No context: 51.3 tok/s, TTFT 110ms 🟣200K prefill: 34.6 tok/s, TTFT 85s (~2,341 tok/s prefill) Prefill is way faster than a Mac. And 35 tok/s deep into 200K context, on a model this strong, is genuinely usable. The Spark plays a different game.

stevibe

34,109 views • 2 months ago

LFM2-VL is done✅ M3 max stats: - Full precision (~250 tok/s) - 4 bit quant (~530 tok/s)

LFM2-VL is done✅ M3 max stats: - Full precision (~250 tok/s) - 4 bit quant (~530 tok/s)

Prince Canuma

47,190 views • 11 months ago

How Fast is Gemma 4 on a MacBook Pro M4? Benchmarking Google's new MoE (26B-A4B) > Model size: 26.1 GiB > Load time: ~4.2s Comparing single request VS > concurrent requests performance > 32k total context, 4 parallel slots single request behavior > TTFT: 5.68s > prompt: 3,701 tokens @ 652 tok/s > decode: 40.08 tok/s sequential (1 request at a time): > avg duration: 20.5s > p99: 22.1s > throughput: 40.11 tok/s > clean finishes: 100% concurrent (4 parallel requests): > aggregate throughput: 47.25 tok/s > total system throughput: 262.27 tok/s > avg duration: 65.1s > p95 latency: 68.8s > req/sec: 0.058 Head-to-Head: Sequential vs Concurrent throughput: > 40.11 tok/s → 47.25 tok/s (+17.8%) > small gain despite 4x parallelism latency per request: > 20.5s → 65.1s (~3.2x slower) > you pay heavily for concurrency system throughput (true utilization): > ~40 tok/s → 262 tok/s (~6.5x total output) > this is where concurrency wins tokens per second (decode ceiling): > ~40 tok/s steady in both modes > hardware-bound, not scheduler-bound TTFT impact: > ~5.7s baseline → buried under queueing in concurrent > “headers waittime” becomes the bottleneck What this actually means? - You don’t get linear scaling from parallel slots - You trade latency for total output - Mac Unified Memory setup is clearly saturating - Bandwidth + Scheduling overhead show up immediately This is exactly why GPUs dominate here Concurrency without killing latency

How Fast is Gemma 4 on a MacBook Pro M4? Benchmarking Google's new MoE (26B-A4B) > Model size: 26.1 GiB > Load time: ~4.2s Comparing single request VS > concurrent requests performance > 32k total context, 4 parallel slots single request behavior > TTFT: 5.68s > prompt: 3,701 tokens @ 652 tok/s > decode: 40.08 tok/s sequential (1 request at a time): > avg duration: 20.5s > p99: 22.1s > throughput: 40.11 tok/s > clean finishes: 100% concurrent (4 parallel requests): > aggregate throughput: 47.25 tok/s > total system throughput: 262.27 tok/s > avg duration: 65.1s > p95 latency: 68.8s > req/sec: 0.058 Head-to-Head: Sequential vs Concurrent throughput: > 40.11 tok/s → 47.25 tok/s (+17.8%) > small gain despite 4x parallelism latency per request: > 20.5s → 65.1s (~3.2x slower) > you pay heavily for concurrency system throughput (true utilization): > ~40 tok/s → 262 tok/s (~6.5x total output) > this is where concurrency wins tokens per second (decode ceiling): > ~40 tok/s steady in both modes > hardware-bound, not scheduler-bound TTFT impact: > ~5.7s baseline → buried under queueing in concurrent > “headers waittime” becomes the bottleneck What this actually means? - You don’t get linear scaling from parallel slots - You trade latency for total output - Mac Unified Memory setup is clearly saturating - Bandwidth + Scheduling overhead show up immediately This is exactly why GPUs dominate here Concurrency without killing latency

Ahmad

88,866 views • 4 months ago

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

Ronald Mannak

50,180 views • 3 months ago

Qwen3.5-35B with only 3B active parameters This MoE model runs FASTER than most 7B dense models. Tested on 3 generations of NVIDIA: - 5090: 137 tok/s - 4090: 112 tok/s - 3090: 78 tok/s The surprise? The 4090 <> 5090 gap is only 22%. With a 3B active MoE, even old GPUs fly.

Qwen3.5-35B with only 3B active parameters This MoE model runs FASTER than most 7B dense models. Tested on 3 generations of NVIDIA: - 5090: 137 tok/s - 4090: 112 tok/s - 3090: 78 tok/s The surprise? The 4090 <> 5090 gap is only 22%. With a 3B active MoE, even old GPUs fly.

stevibe

69,532 views • 4 months ago

Nemotron-3-Ultra running on 4x 6000s edits my latest demo video.. - 75 tok/s decode - 8x concurrency - 256k context - 899 tok/s prefill - 20k tok/s prefill cache - NVFP4 Setting it up to be my Hermes driver. It's good enough at most things and doesn't talk like a moron.

Nemotron-3-Ultra running on 4x 6000s edits my latest demo video.. - 75 tok/s decode - 8x concurrency - 256k context - 899 tok/s prefill - 20k tok/s prefill cache - NVFP4 Setting it up to be my Hermes driver. It's good enough at most things and doesn't talk like a moron.

0xSero

15,674 views • 1 month ago

Gemma4 12B with Unsloth's Quant on DGX Spark Quants: - UD_Q4_K_XL - UD_Q5_K_XL - UD_Q6_K_XL - UD_Q8_K_XL Summary: - Q4: 25.21 tok/s, TTFT 168ms - Q5: 21.7 tok/s, TTFT 182ms - Q6: 17.68 tok/s, TTFT 193.95ms - Q8: 15.22 tok/s, TTFT 221ms

Gemma4 12B with Unsloth's Quant on DGX Spark Quants: - UD_Q4_K_XL - UD_Q5_K_XL - UD_Q6_K_XL - UD_Q8_K_XL Summary: - Q4: 25.21 tok/s, TTFT 168ms - Q5: 21.7 tok/s, TTFT 182ms - Q6: 17.68 tok/s, TTFT 193.95ms - Q8: 15.22 tok/s, TTFT 221ms

stevibe

18,639 views • 1 month ago

GLM 4.7 Flash is supported in mlx-lm 0.30.3 (h/t Ivan Fioravanti ᯅ) The 4-bit runs fast (43 tok/s generation, ~800 tok/s prefill) on a base M5 32GB laptop.

GLM 4.7 Flash is supported in mlx-lm 0.30.3 (h/t Ivan Fioravanti ᯅ) The 4-bit runs fast (43 tok/s generation, ~800 tok/s prefill) on a base M5 32GB laptop.

Awni Hannun

141,194 views • 6 months ago

llama.cpp with MTP support makes local models fast enough to use as daily drivers 🚀 Qwen3.6-27B dense generation (on A10G): From 25 tok/s → 45 tok/s (+78%). Two flags on llama-server: --spec-type draft-mtp --spec-draft-n-max 2

llama.cpp with MTP support makes local models fast enough to use as daily drivers 🚀 Qwen3.6-27B dense generation (on A10G): From 25 tok/s → 45 tok/s (+78%). Two flags on llama-server: --spec-type draft-mtp --spec-draft-n-max 2

Victor M

171,818 views • 2 months ago

Demo for the sharded inference engine on: gpt-oss-120b on 4090's, max input context 95k, max output context infinite (tok/s degrades with context) queue of max 12 prompts concurrently, use it wisely

Demo for the sharded inference engine on: gpt-oss-120b on 4090's, max input context 95k, max output context infinite (tok/s degrades with context) queue of max 12 prompts concurrently, use it wisely

leyten

90,860 views • 1 month ago

Ollama me daba 21 tok/s con Qwen3.6 35B (12 GB VRAM). Mismo modelo, misma GPU → llama.cpp + -ncmoe 15 = 70 tok/s. No es magia. Es un flag que Ollama no expone. Comando exacto: llama-cli -m ~/models/Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf -ngl 99 -ncmoe 15 -p "Hola" Demo real aquí 👇

Ollama me daba 21 tok/s con Qwen3.6 35B (12 GB VRAM). Mismo modelo, misma GPU → llama.cpp + -ncmoe 15 = 70 tok/s. No es magia. Es un flag que Ollama no expone. Comando exacto: llama-cli -m ~/models/Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf -ngl 99 -ncmoe 15 -p "Hola" Demo real aquí 👇

OscarMartin

302,902 views • 2 months ago

2x RTX 3090s running 32 coding agents in parallel. Task: Build a trading bot (handle partial fills, rate limits, race conditions + explain reasoning). 🚀 qwen3.6-35b-a3b-autoround on 3090 35B • Total: 76,183 tokens in 64s • Aggregate: 3051.1 tok/s peak · 1183.8 tok/s sustained • Per-stream: 101.2 high / 80.5 low / 96.8 avg tok/s • TTFT 0.25s avg · E2E 55s avg

2x RTX 3090s running 32 coding agents in parallel. Task: Build a trading bot (handle partial fills, rate limits, race conditions + explain reasoning). 🚀 qwen3.6-35b-a3b-autoround on 3090 35B • Total: 76,183 tokens in 64s • Aggregate: 3051.1 tok/s peak · 1183.8 tok/s sustained • Per-stream: 101.2 high / 80.5 low / 96.8 avg tok/s • TTFT 0.25s avg · E2E 55s avg

Tech2Wild

15,456 views • 26 days ago