
stevibe
@stevibe • 22,337 subscribers
LLM. Local AI addict. Building @BenchLocalAI Builds things nobody asked for. Benchmarks things for fun.
Shorts
Videos

Opus 4.7 first-hour impressions Ran the canvas tree growth test twice. 4.6: nailed the animation both times 4.7: static tree, no growth animation — twice 4.7's thinking is noticeably shorter and faster though (trimmed some 4.6 thinking in the clip for pacing). Not the upgrade direction I expected on this one.
stevibe487,622 次观看 • 1 个月前

Qwen3.6 35B-A3B dropped yesterday, so I ran it on 4 GPUs to see how it performs: 🟣 RTX 3090 — 49.78 tok/s, TTFT 852ms 🟡 RTX 4090 — 118.93 tok/s, TTFT 686ms 🟢 RTX 5090 — 160.37 tok/s, TTFT 409ms 🔵 DGX Spark — 59.98 tok/s, TTFT 228ms I went with ollama as the backend because honestly, it's the easiest way for most people to get started. One command, model pulled, done. I used Q4_K_M (24GB) across all four cards. The reason is the 3090 and 4090 don't support NVFP4 (only the 5090 and DGX Spark could use it). Keeping the same quant everywhere felt like the fairest way to compare. And yes, you can absolutely squeeze more performance out of every card with vLLM, SGLang, or TensorRT-LLM. But that's not what this test is about. This is just the out-of-the-box experience for folks who own a GPU and want to try the new model tonight.
stevibe388,832 次观看 • 1 个月前

MiniMax M2.7 is 230B params. Can you actually run it at home? I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs: 🟠 4x RTX 4090 (96GB): 71.52 tok/s, TTFT 1045ms 🟢 4x RTX 5090 (128GB): 120.54 tok/s, TTFT 725ms 🟡 1x RTX PRO 6000 (96GB): 118.74 tok/s, TTFT 765ms 🟣 DGX Spark (128GB) — 24.41 tok/s, TTFT 741ms Backend: llama.cpp. Context: 32k. Max tokens: 4096. I went with IQ3_XXS because it's the biggest quant that fits in 96GB VRAM while still leaving safe headroom for 32k context. Same quant across all four rigs, fairest comparison I could run. Now look at rough peak GPU power draw: 🟠 4x4090 → 1,800W peak (450W × 4) 🟢 4x5090 → 2,300W peak (575W × 4) 🟡 RTX PRO 6000 → 600W peak 🟣 DGX Spark → 240W peak (whole system) The RTX PRO 6000 is the quiet winner. One card, 96GB, matching a 4x5090 rig at roughly a quarter of the power and zero multi-GPU headaches. Best tokens-per-watt by a wide margin. DGX Spark is slow on generation but pulls the least power of any rig here, around 240W for the whole system. Prefill-friendly, memory-rich, wall-socket-friendly. And yes, plenty of people cap their cards. Even then, 4x 4090 or 4x 5090 still pulls well over 1,200W from the GPUs alone.
stevibe190,022 次观看 • 1 个月前

Step-3.7-Flash Q4_K_S on DGX Spark (GB10, 128GB): > ~27 tok/s generation > 198B sparse MoE, ~11B active > 256K context, native vision > Agentic / tool-calling / reasoning > Apache 2.0 I added a mobile chat screen on the right showing what 27 tok/s actually feels like streaming on a phone.
stevibe19,372 次观看 • 6 天前

Qwen3.6 27B landed yesterday, so I ran it on 4 setups side-by-side to see how they stack up: 🔴 RTX 4090 — 45.59 tok/s, TTFT 525ms 🟢 RTX 5090 — 51.83 tok/s, TTFT 752ms ⚫️ M2 Ultra — 22.30 tok/s, TTFT 216ms 🟣 DGX Spark — 11.08 tok/s, TTFT 319ms This is a standard test: no tuning, just the out-of-the-box experience. For the NVIDIA cards I used llama.cpp with Unsloth's UD-Q4_K_XL quant. For the M2 Ultra I used MLX with Unsloth's UD-MLX-4bit quant, since MLX is the native path on Apple Silicon. Please consider this as the baseline, you can definitely squeeze more out of every one of these with fine-tuned settings.
stevibe102,503 次观看 • 1 个月前

Finally got my hands on the big one. Qwen3.5-122B-A10B — 122 billion parameters. Too big for any single consumer GPU. So I rented 4 of each... and then one professional card to see if brute force even matters. - 1x RTX PRO 6000 (96GB): 101.4 tok/s - 4x 5090 (128GB): 87.0 tok/s - 4x 4090 (96GB): 25.1 tok/s - 4x 3090 (96GB): 20.8 tok/s One single $8,500 card beat four RTX 5090s
stevibe195,027 次观看 • 2 个月前

Which LLMs actually love to think? Tested 7 models on 5 math problems, measured reasoning length. The think winners: both Qwen3.5 models (27B and 35B A3B) — massive overthinkers, up to 10k+ tokens on a single question. Plot twists: > Kimi K2.6 feels verbose, actually one of the leanest > Gemma4 26B A4B solved 2 with ZERO thinking
stevibe96,938 次观看 • 1 个月前

So we know Gemma 4 is good at tool calling, but what about web coding? I threw 4 UI screenshots at three Gemma 4 models and said rebuild this, one shot, no hand-holding, just image in, code out. Model lineup: - E4B - 26B A4B (MoE) - 31B Dense (skipped the E2B this round) Let me know which one you think cooked the hardest
stevibe124,613 次观看 • 2 个月前

Yesterday we found out Qwen3.5-27B (dense) beats both the 35B and 122B MoE variants at UI replication from screenshots. Today's question: what about Jackrong's Qwen3.5-27B-Claude-Opus-4.6-Reasoning-Distilled? A distilled 27B with Claude Opus 4.6 reasoning baked in. So I put all 3 in the ring: - Qwen3.5-27B - Qwen3.5-27B-Claude-Opus-Distilled - Claude Opus 4.6 Since the distilled model isn't multimodal, I gave all three the same detailed text prompts describing the UI components from yesterday's test.
stevibe125,494 次观看 • 2 个月前

I'm obsessed with pushing local small models to their limits. Qwen3.5:0.8b doing real-time video captioning on a Mac Studio M2 Ultra, streaming descriptions as the video plays. Under 1s per frame — 269 frames captured & described from a 3m49s video. Pause anywhere and read the captions, it describes every frame surprisingly well. This model is barely 1GB. Local AI is moving absurdly fast.
stevibe106,159 次观看 • 2 个月前

Qwen3.6-27B on RTX 5090, 4 power limits tested: 🔴 400W → 66.58 t/s · baseline (but fluctuates a lot — frequent dips) 🟢 450W → 69.79 t/s · +4.8% speed for +12.5% power (much more stable) 🟡 500W → 71.48 t/s · +2.4% speed for +11.1% power 🟣 575W → 72.64 t/s · +1.6% speed for +15.0% power ➡️ 400W → 575W: +44% power, +9% speed. Conclusion: 450W is the real sweet spot. 400W looks great on the average but the t/s curve is jittery; 450W trades 50W for consistent throughput. Above that, you're just heating your room.
stevibe54,961 次观看 • 1 个月前