
Ahmad
@TheAhmadOsman • 59,655 subscribers
ai, chips, systems engineering, infra & hardware · on a mission to build a frontier, infra-first AI Lab in the West · i mod GPUs on r/LocalLLaMA
Shorts
Videos

INCREDIBLE SPEED running Claude Code w/ local models on my own GPUs at home > SGLang serving MiniMax-M2.1 > on 8x RTX 3090s > nvtop showing live GPU load > Claude Code generating code + docs > end-2-end on my AI cluster MiniMax-M2.1 is my favorite model to run locally nowadays
Ahmad592,866 görüntüleme • 4 ay önce

running Qwen3.5 397B MoE (17B active/token) on 4x DGX Sparks in FP8 (~400GB) > OpenCode driving > agent exploring its own config > probing all 4 Sparks (via ssh) + reporting thermals > inspecting how vLLM is serving it > collecting + analyzing its own stats local AI is awesome
Ahmad121,691 görüntüleme • 2 ay önce

running Claude Code w/ local models on my own GPUs at home > vLLM serving GLM-4.5 Air > on 4x RTX 3090s > nvtop showing live GPU load > Claude Code generating code + docs > end-to-end on my AI cluster this is what local AI actually looks like Buy a GPU
Ahmad261,640 görüntüleme • 5 ay önce

MiniMax M2.7 at home running on 4x DGX Sparks vLLM serving full BF16 weights, 200k context OpenCode having the model monitor its own hardware and report thermals, tokens/sec, TTFT, and other runtime stats in real time What benchmarks / workflows / things do you wanna see next?
Ahmad91,625 görüntüleme • 2 ay önce

How Fast is Gemma 4 on a MacBook Pro M4? Benchmarking Google's new MoE (26B-A4B) > Model size: 26.1 GiB > Load time: ~4.2s Comparing single request VS > concurrent requests performance > 32k total context, 4 parallel slots single request behavior > TTFT: 5.68s > prompt: 3,701 tokens @ 652 tok/s > decode: 40.08 tok/s sequential (1 request at a time): > avg duration: 20.5s > p99: 22.1s > throughput: 40.11 tok/s > clean finishes: 100% concurrent (4 parallel requests): > aggregate throughput: 47.25 tok/s > total system throughput: 262.27 tok/s > avg duration: 65.1s > p95 latency: 68.8s > req/sec: 0.058 Head-to-Head: Sequential vs Concurrent throughput: > 40.11 tok/s → 47.25 tok/s (+17.8%) > small gain despite 4x parallelism latency per request: > 20.5s → 65.1s (~3.2x slower) > you pay heavily for concurrency system throughput (true utilization): > ~40 tok/s → 262 tok/s (~6.5x total output) > this is where concurrency wins tokens per second (decode ceiling): > ~40 tok/s steady in both modes > hardware-bound, not scheduler-bound TTFT impact: > ~5.7s baseline → buried under queueing in concurrent > “headers waittime” becomes the bottleneck What this actually means? - You don’t get linear scaling from parallel slots - You trade latency for total output - Mac Unified Memory setup is clearly saturating - Bandwidth + Scheduling overhead show up immediately This is exactly why GPUs dominate here Concurrency without killing latency
Ahmad87,545 görüntüleme • 2 ay önce

Running Qwen 3.5 4B on my iPhone 17 Pro Max Very smart & capable model for how small it is Very fast as well
Ahmad107,405 görüntüleme • 3 ay önce

Impressive demo running on a single box The future of AI is local btw
Ahmad40,078 görüntüleme • 1 ay önce

MASSIVE llama cpp now now ships with a built-in web UI stop using ollama, there are no more excuses
Ahmad94,975 görüntüleme • 7 ay önce

GPU GIVEAWAY Buy a GPU × GTC 2026 = Give a GPU > RTX PRO 6000 Blackwell > 96GB VRAM • NVFP4 > ~$15K value > Brand new If this does well I’ll ask NVIDIA for more GPUs next time… maybe even DGX Sparks How to enter? Short clip here Full clip in the replies GO GO GO
Ahmad36,697 görüntüleme • 3 ay önce

TEN HOURS LEFT TO WIN DGX Sparks & Mac Minis > ur submission can be as simple as > sending hello world msg while showing ur laptop > or as complicated as a room full of hardware > u have to use ur office chair as a laptop stand reply with ur submissions so i don’t miss them
Ahmad26,838 görüntüleme • 6 ay önce
Daha fazla içerik yok.