
bstn 👁️
@bstnxbt • 1,443 subscribers
swe, ai, mlx
Videos

DFlash speculative decoding on Apple Silicon Qwen3.5-9B bf16 · M5 Max · greedy exact match ▸ 85 tok/s, 3.3× at 1024 tokens (runtime) ▸ ~70 tok/s, 2.6× in the video (terminal I/O overhead) ▸ 80 tok/s, 3.1× at 2048 tokens (runtime) Currently working on: → Long context (speedup degrades past 4K tokens, KV cache growth) → Int4 quantized models (27B class) Built on MLX, no CUDA, single machine. Draft generates 16 tokens in parallel, target verifies in one forward pass. Will open source when ready.
bstn 👁️36,888 Aufrufe • vor 2 Monaten

dflash-mlx v0.1.7 is out. Big adaptive-runtime update, still focused mostly on Qwen3.6 27B 4-bit. @ 2048 tokens, M5 Max, stock mlx_lm baseline: ► 1024: 33.26 → 98.05 tok/s (x2.95) ► 2048: 32.34 → 90.67 tok/s (x2.81) ► 4096: 30.58 → 93.55 tok/s (x3.06) ► 8192: 26.03 → 79.12 tok/s (x3.04) ► 16384: 21.50 → 60.77 tok/s (x2.78) Main change: adaptive verify got a lot smarter. Instead of blindly trying to verify large 16-token blocks all the time, DFlash now watches acceptance + tokens/cycle + real cycle cost. When the draft gets weaker, it drops to smaller 4-token blocks, then probes back up only when the recent cycles make sense. In practice: less wasted verify work, better long-context behavior, and much more useful metrics to understand what is happening. ► retuned adaptive verify for long-context / agentic decode ► richer metrics: tokens/cycle, adaptive block state, CopySpec counters ► /metrics now has real decode avg + logical/real/restored prefill rates ► AIME25 benchmark suite with exact integer scoring ► Qwen thinking default now follows tokenizer/request behavior ► GDN recurrent exactness fixes I also started running AIME25-style long generations. Even around 45k generated tokens, I was still seeing ~40 tok/s on 27B 4-bit. Over the next few days I’ll share more demos: AIME runs, real OpenCode game/project sessions, and full metrics along the way. Still optimizing hard for 27B 4-bit first, while working on custom kernels per Apple GPU generation so more machines can benefit.
bstn 👁️16,334 Aufrufe • vor 26 Tagen

DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction at long context. M5 Max 40-core GPU, 64GB, stock mlx_lm baseline: Qwen3.6-35B-A3B-4bit: ► @ 1024 · 138.3 → 300.3 tok/s (2.20x) ► @ 2048 · 135.6 → 246.4 tok/s (1.81x) ► @ 4096 · 134.5 → 208.4 tok/s (1.56x) ► @ 8192 · 133.2 → 177.4 tok/s (1.33x) Qwen3.5-27B-4bit: ► @ 1024 · 33.5 → 79.0 tok/s (2.37x) ► @ 2048 · 33.1 → 70.2 tok/s (2.12x) ► @ 4096 · 31.5 → 55.7 tok/s (1.77x) ► @ 8192 · 33.9 → 45.3 tok/s (1.34x) Working on making this usable for agentic workloads goal is to never drop below baseline at any context depth. LLM decode is memory-bandwidth bound. M5 Max runs at 614 GB/s, that's 1.5x more than M1-M4 Max (400-410 GB/s). Results will vary on lower bandwidth chips.
bstn 👁️23,120 Aufrufe • vor 1 Monat
Keine weiteren Inhalte verfügbar