
Ahmad
@TheAhmadOsman • 59,655 subscribers
ai, chips, systems engineering, infra & hardware · on a mission to build a frontier, infra-first AI Lab in the West · i mod GPUs on r/LocalLLaMA
Shorts
Videos

INCREDIBLE SPEED running Claude Code w/ local models on my own GPUs at home > SGLang serving MiniMax-M2.1 > on 8x RTX 3090s > nvtop showing live GPU load > Claude Code generating code + docs > end-2-end on my AI cluster MiniMax-M2.1 is my favorite model to run locally nowadays
Ahmad592,866 次观看 • 4 个月前

running Qwen3.5 397B MoE (17B active/token) on 4x DGX Sparks in FP8 (~400GB) > OpenCode driving > agent exploring its own config > probing all 4 Sparks (via ssh) + reporting thermals > inspecting how vLLM is serving it > collecting + analyzing its own stats local AI is awesome
Ahmad121,691 次观看 • 2 个月前

MiniMax M2.7 at home running on 4x DGX Sparks vLLM serving full BF16 weights, 200k context OpenCode having the model monitor its own hardware and report thermals, tokens/sec, TTFT, and other runtime stats in real time What benchmarks / workflows / things do you wanna see next?
Ahmad91,625 次观看 • 2 个月前

How Fast is Gemma 4 on a MacBook Pro M4? Benchmarking Google's new MoE (26B-A4B) > Model size: 26.1 GiB > Load time: ~4.2s Comparing single request VS > concurrent requests performance > 32k total context, 4 parallel slots single request behavior > TTFT: 5.68s > prompt: 3,701 tokens @ 652 tok/s > decode: 40.08 tok/s sequential (1 request at a time): > avg duration: 20.5s > p99: 22.1s > throughput: 40.11 tok/s > clean finishes: 100% concurrent (4 parallel requests): > aggregate throughput: 47.25 tok/s > total system throughput: 262.27 tok/s > avg duration: 65.1s > p95 latency: 68.8s > req/sec: 0.058 Head-to-Head: Sequential vs Concurrent throughput: > 40.11 tok/s → 47.25 tok/s (+17.8%) > small gain despite 4x parallelism latency per request: > 20.5s → 65.1s (~3.2x slower) > you pay heavily for concurrency system throughput (true utilization): > ~40 tok/s → 262 tok/s (~6.5x total output) > this is where concurrency wins tokens per second (decode ceiling): > ~40 tok/s steady in both modes > hardware-bound, not scheduler-bound TTFT impact: > ~5.7s baseline → buried under queueing in concurrent > “headers waittime” becomes the bottleneck What this actually means? - You don’t get linear scaling from parallel slots - You trade latency for total output - Mac Unified Memory setup is clearly saturating - Bandwidth + Scheduling overhead show up immediately This is exactly why GPUs dominate here Concurrency without killing latency
Ahmad87,545 次观看 • 2 个月前

TEN HOURS LEFT TO WIN DGX Sparks & Mac Minis > ur submission can be as simple as > sending hello world msg while showing ur laptop > or as complicated as a room full of hardware > u have to use ur office chair as a laptop stand reply with ur submissions so i don’t miss them
Ahmad26,838 次观看 • 6 个月前
没有更多内容可加载