bstn 👁️'s banner

bstn 👁️

@bstnxbt • 1,460 subscribers

swe, ai, mlx

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

DFlash speculative decoding on Apple Silicon Qwen3.5-9B bf16 · M5 Max · greedy exact match ▸ 85 tok/s, 3.3× at 1024 tokens (runtime) ▸ ~70 tok/s, 2.6× in the video (terminal I/O overhead) ▸ 80 tok/s, 3.1× at 2048 tokens (runtime) Currently working on: → Long context (speedup degrades past 4K tokens, KV cache growth) → Int4 quantized models (27B class) Built on MLX, no CUDA, single machine. Draft generates 16 tokens in parallel, target verifies in one forward pass. Will open source when ready.

DFlash speculative decoding on Apple Silicon Qwen3.5-9B bf16 · M5 Max · greedy exact match ▸ 85 tok/s, 3.3× at 1024 tokens (runtime) ▸ ~70 tok/s, 2.6× in the video (terminal I/O overhead) ▸ 80 tok/s, 3.1× at 2048 tokens (runtime) Currently working on: → Long context (speedup degrades past 4K tokens, KV cache growth) → Int4 quantized models (27B class) Built on MLX, no CUDA, single machine. Draft generates 16 tokens in parallel, target verifies in one forward pass. Will open source when ready.

36,942 Aufrufe • vor 3 Monaten

DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction at long context. M5 Max 40-core GPU, 64GB, stock mlx_lm baseline: Qwen3.6-35B-A3B-4bit: ► @ 1024 · 138.3 → 300.3 tok/s (2.20x) ► @ 2048 · 135.6 → 246.4 tok/s (1.81x) ► @ 4096 · 134.5 → 208.4 tok/s (1.56x) ► @ 8192 · 133.2 → 177.4 tok/s (1.33x) Qwen3.5-27B-4bit: ► @ 1024 · 33.5 → 79.0 tok/s (2.37x) ► @ 2048 · 33.1 → 70.2 tok/s (2.12x) ► @ 4096 · 31.5 → 55.7 tok/s (1.77x) ► @ 8192 · 33.9 → 45.3 tok/s (1.34x) Working on making this usable for agentic workloads goal is to never drop below baseline at any context depth. LLM decode is memory-bandwidth bound. M5 Max runs at 614 GB/s, that's 1.5x more than M1-M4 Max (400-410 GB/s). Results will vary on lower bandwidth chips.

DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction at long context. M5 Max 40-core GPU, 64GB, stock mlx_lm baseline: Qwen3.6-35B-A3B-4bit: ► @ 1024 · 138.3 → 300.3 tok/s (2.20x) ► @ 2048 · 135.6 → 246.4 tok/s (1.81x) ► @ 4096 · 134.5 → 208.4 tok/s (1.56x) ► @ 8192 · 133.2 → 177.4 tok/s (1.33x) Qwen3.5-27B-4bit: ► @ 1024 · 33.5 → 79.0 tok/s (2.37x) ► @ 2048 · 33.1 → 70.2 tok/s (2.12x) ► @ 4096 · 31.5 → 55.7 tok/s (1.77x) ► @ 8192 · 33.9 → 45.3 tok/s (1.34x) Working on making this usable for agentic workloads goal is to never drop below baseline at any context depth. LLM decode is memory-bandwidth bound. M5 Max runs at 614 GB/s, that's 1.5x more than M1-M4 Max (400-410 GB/s). Results will vary on lower bandwidth chips.

23,120 Aufrufe • vor 3 Monaten

dflash-mlx v0.1.7 is out. Big adaptive-runtime update, still focused mostly on Qwen3.6 27B 4-bit. @ 2048 tokens, M5 Max, stock mlx_lm baseline: ► 1024: 33.26 → 98.05 tok/s (x2.95) ► 2048: 32.34 → 90.67 tok/s (x2.81) ► 4096: 30.58 → 93.55 tok/s (x3.06) ► 8192: 26.03 → 79.12 tok/s (x3.04) ► 16384: 21.50 → 60.77 tok/s (x2.78) Main change: adaptive verify got a lot smarter. Instead of blindly trying to verify large 16-token blocks all the time, DFlash now watches acceptance + tokens/cycle + real cycle cost. When the draft gets weaker, it drops to smaller 4-token blocks, then probes back up only when the recent cycles make sense. In practice: less wasted verify work, better long-context behavior, and much more useful metrics to understand what is happening. ► retuned adaptive verify for long-context / agentic decode ► richer metrics: tokens/cycle, adaptive block state, CopySpec counters ► /metrics now has real decode avg + logical/real/restored prefill rates ► AIME25 benchmark suite with exact integer scoring ► Qwen thinking default now follows tokenizer/request behavior ► GDN recurrent exactness fixes I also started running AIME25-style long generations. Even around 45k generated tokens, I was still seeing ~40 tok/s on 27B 4-bit. Over the next few days I’ll share more demos: AIME runs, real OpenCode game/project sessions, and full metrics along the way. Still optimizing hard for 27B 4-bit first, while working on custom kernels per Apple GPU generation so more machines can benefit.

dflash-mlx v0.1.7 is out. Big adaptive-runtime update, still focused mostly on Qwen3.6 27B 4-bit. @ 2048 tokens, M5 Max, stock mlx_lm baseline: ► 1024: 33.26 → 98.05 tok/s (x2.95) ► 2048: 32.34 → 90.67 tok/s (x2.81) ► 4096: 30.58 → 93.55 tok/s (x3.06) ► 8192: 26.03 → 79.12 tok/s (x3.04) ► 16384: 21.50 → 60.77 tok/s (x2.78) Main change: adaptive verify got a lot smarter. Instead of blindly trying to verify large 16-token blocks all the time, DFlash now watches acceptance + tokens/cycle + real cycle cost. When the draft gets weaker, it drops to smaller 4-token blocks, then probes back up only when the recent cycles make sense. In practice: less wasted verify work, better long-context behavior, and much more useful metrics to understand what is happening. ► retuned adaptive verify for long-context / agentic decode ► richer metrics: tokens/cycle, adaptive block state, CopySpec counters ► /metrics now has real decode avg + logical/real/restored prefill rates ► AIME25 benchmark suite with exact integer scoring ► Qwen thinking default now follows tokenizer/request behavior ► GDN recurrent exactness fixes I also started running AIME25-style long generations. Even around 45k generated tokens, I was still seeing ~40 tok/s on 27B 4-bit. Over the next few days I’ll share more demos: AIME runs, real OpenCode game/project sessions, and full metrics along the way. Still optimizing hard for 27B 4-bit first, while working on custom kernels per Apple GPU generation so more machines can benefit.

16,334 Aufrufe • vor 2 Monaten

Keine weiteren Inhalte verfügbar