Ahmad

@TheAhmadOsman • 68,928 subscribers

Founder & CEO @OsmanticAI — Accelerating Opensource & Self-hosted / Local AI Adoption • I moderate GPUs on r/LocalLLaMA

Shorts

Feels like I am just gonna end paying the California taxes

Feels like I am just gonna end paying the California taxes

12,634 görüntüleme

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

MASSIVE NEWS Teamed up with NVIDIA to make Local AI The Default

MASSIVE NEWS Teamed up with NVIDIA to make Local AI The Default

458,831 görüntüleme • 1 ay önce

The Desktop Frontier We're gonna get Kimi K3 equivalent intelligence running on a single RTX PRO 6000 in less than 18 months How? Watch this video if you wanna learn how we get there Bookmark for the future

The Desktop Frontier We're gonna get Kimi K3 equivalent intelligence running on a single RTX PRO 6000 in less than 18 months How? Watch this video if you wanna learn how we get there Bookmark for the future

54,945 görüntüleme • 11 gün önce

DROP EVERYTHING The first panel from the Local AI Summit at AIEWF is now live “State of the Union: Why Local, Why Now” Featuring leaders from NVIDIA, Roboflow, Osmantic, Forward Future, and EXO Labs

DROP EVERYTHING The first panel from the Local AI Summit at AIEWF is now live “State of the Union: Why Local, Why Now” Featuring leaders from NVIDIA, Roboflow, Osmantic, Forward Future, and EXO Labs

92,254 görüntüleme • 20 gün önce

INCREDIBLE SPEED running Claude Code w/ local models on my own GPUs at home > SGLang serving MiniMax-M2.1 > on 8x RTX 3090s > nvtop showing live GPU load > Claude Code generating code + docs > end-2-end on my AI cluster MiniMax-M2.1 is my favorite model to run locally nowadays

INCREDIBLE SPEED running Claude Code w/ local models on my own GPUs at home > SGLang serving MiniMax-M2.1 > on 8x RTX 3090s > nvtop showing live GPU load > Claude Code generating code + docs > end-2-end on my AI cluster MiniMax-M2.1 is my favorite model to run locally nowadays

593,078 görüntüleme • 6 ay önce

running Claude Code w/ local models on my own GPUs at home > vLLM serving GLM-4.5 Air > on 4x RTX 3090s > nvtop showing live GPU load > Claude Code generating code + docs > end-to-end on my AI cluster this is what local AI actually looks like Buy a GPU

running Claude Code w/ local models on my own GPUs at home > vLLM serving GLM-4.5 Air > on 4x RTX 3090s > nvtop showing live GPU load > Claude Code generating code + docs > end-to-end on my AI cluster this is what local AI actually looks like Buy a GPU

262,226 görüntüleme • 6 ay önce

running Qwen3.5 397B MoE (17B active/token) on 4x DGX Sparks in FP8 (~400GB) > OpenCode driving > agent exploring its own config > probing all 4 Sparks (via ssh) + reporting thermals > inspecting how vLLM is serving it > collecting + analyzing its own stats local AI is awesome

running Qwen3.5 397B MoE (17B active/token) on 4x DGX Sparks in FP8 (~400GB) > OpenCode driving > agent exploring its own config > probing all 4 Sparks (via ssh) + reporting thermals > inspecting how vLLM is serving it > collecting + analyzing its own stats local AI is awesome

122,543 görüntüleme • 3 ay önce

MiniMax M2.7 at home running on 4x DGX Sparks vLLM serving full BF16 weights, 200k context OpenCode having the model monitor its own hardware and report thermals, tokens/sec, TTFT, and other runtime stats in real time What benchmarks / workflows / things do you wanna see next?

MiniMax M2.7 at home running on 4x DGX Sparks vLLM serving full BF16 weights, 200k context OpenCode having the model monitor its own hardware and report thermals, tokens/sec, TTFT, and other runtime stats in real time What benchmarks / workflows / things do you wanna see next?

92,491 görüntüleme • 3 ay önce

Running Qwen 3.5 4B on my iPhone 17 Pro Max Very smart & capable model for how small it is Very fast as well

Running Qwen 3.5 4B on my iPhone 17 Pro Max Very smart & capable model for how small it is Very fast as well

107,511 görüntüleme • 4 ay önce

How Fast is Gemma 4 on a MacBook Pro M4? Benchmarking Google's new MoE (26B-A4B) > Model size: 26.1 GiB > Load time: ~4.2s Comparing single request VS > concurrent requests performance > 32k total context, 4 parallel slots single request behavior > TTFT: 5.68s > prompt: 3,701 tokens @ 652 tok/s > decode: 40.08 tok/s sequential (1 request at a time): > avg duration: 20.5s > p99: 22.1s > throughput: 40.11 tok/s > clean finishes: 100% concurrent (4 parallel requests): > aggregate throughput: 47.25 tok/s > total system throughput: 262.27 tok/s > avg duration: 65.1s > p95 latency: 68.8s > req/sec: 0.058 Head-to-Head: Sequential vs Concurrent throughput: > 40.11 tok/s → 47.25 tok/s (+17.8%) > small gain despite 4x parallelism latency per request: > 20.5s → 65.1s (~3.2x slower) > you pay heavily for concurrency system throughput (true utilization): > ~40 tok/s → 262 tok/s (~6.5x total output) > this is where concurrency wins tokens per second (decode ceiling): > ~40 tok/s steady in both modes > hardware-bound, not scheduler-bound TTFT impact: > ~5.7s baseline → buried under queueing in concurrent > “headers waittime” becomes the bottleneck What this actually means? - You don’t get linear scaling from parallel slots - You trade latency for total output - Mac Unified Memory setup is clearly saturating - Bandwidth + Scheduling overhead show up immediately This is exactly why GPUs dominate here Concurrency without killing latency

How Fast is Gemma 4 on a MacBook Pro M4? Benchmarking Google's new MoE (26B-A4B) > Model size: 26.1 GiB > Load time: ~4.2s Comparing single request VS > concurrent requests performance > 32k total context, 4 parallel slots single request behavior > TTFT: 5.68s > prompt: 3,701 tokens @ 652 tok/s > decode: 40.08 tok/s sequential (1 request at a time): > avg duration: 20.5s > p99: 22.1s > throughput: 40.11 tok/s > clean finishes: 100% concurrent (4 parallel requests): > aggregate throughput: 47.25 tok/s > total system throughput: 262.27 tok/s > avg duration: 65.1s > p95 latency: 68.8s > req/sec: 0.058 Head-to-Head: Sequential vs Concurrent throughput: > 40.11 tok/s → 47.25 tok/s (+17.8%) > small gain despite 4x parallelism latency per request: > 20.5s → 65.1s (~3.2x slower) > you pay heavily for concurrency system throughput (true utilization): > ~40 tok/s → 262 tok/s (~6.5x total output) > this is where concurrency wins tokens per second (decode ceiling): > ~40 tok/s steady in both modes > hardware-bound, not scheduler-bound TTFT impact: > ~5.7s baseline → buried under queueing in concurrent > “headers waittime” becomes the bottleneck What this actually means? - You don’t get linear scaling from parallel slots - You trade latency for total output - Mac Unified Memory setup is clearly saturating - Bandwidth + Scheduling overhead show up immediately This is exactly why GPUs dominate here Concurrency without killing latency

88,866 görüntüleme • 3 ay önce

Had a great time today getting hosted by Hugging Face to present and demo on Local AI, Inference Engines and ODS Huge shoutout to merve for all her great work on educating people about Local and Opensource AI, and to my partner in crime Mike Bradley

Had a great time today getting hosted by Hugging Face to present and demo on Local AI, Inference Engines and ODS Huge shoutout to merve for all her great work on educating people about Local and Opensource AI, and to my partner in crime Mike Bradley

10,225 görüntüleme • 10 gün önce

While everyone is talking about GPT-5.4 Thinking and GPT-5.4 Pro I wanna remind you that I am GIVING AWAY this $15,000 GPU So you can run your AI at home instead of sending your data to OpenAI, Anthropic, etc COMPLETELY FREE Take a min to sign up below & this could be yours

While everyone is talking about GPT-5.4 Thinking and GPT-5.4 Pro I wanna remind you that I am GIVING AWAY this $15,000 GPU So you can run your AI at home instead of sending your data to OpenAI, Anthropic, etc COMPLETELY FREE Take a min to sign up below & this could be yours

91,558 görüntüleme • 4 ay önce

Egypt is beating Argentina let’s go!!!

Egypt is beating Argentina let’s go!!!

16,846 görüntüleme • 24 gün önce

I asked Jensen whether we will see more Nemotron models or if the recent releases were just to prove NVFP4 training works

I asked Jensen whether we will see more Nemotron models or if the recent releases were just to prove NVFP4 training works

59,656 görüntüleme • 4 ay önce

Impressive demo running on a single box The future of AI is local btw

Impressive demo running on a single box The future of AI is local btw

40,324 görüntüleme • 3 ay önce

MASSIVE llama cpp now now ships with a built-in web UI stop using ollama, there are no more excuses

MASSIVE llama cpp now now ships with a built-in web UI stop using ollama, there are no more excuses

95,055 görüntüleme • 8 ay önce

GPU GIVEAWAY Buy a GPU × GTC 2026 = Give a GPU > RTX PRO 6000 Blackwell > 96GB VRAM • NVFP4 > ~$15K value > Brand new If this does well I’ll ask NVIDIA for more GPUs next time… maybe even DGX Sparks How to enter? Short clip here Full clip in the replies GO GO GO

36,697 görüntüleme • 4 ay önce

NVIDIA AI pulled me in for an interview at GTC this week

NVIDIA AI pulled me in for an interview at GTC this week

27,664 görüntüleme • 4 ay önce

TEN HOURS LEFT TO WIN DGX Sparks & Mac Minis > ur submission can be as simple as > sending hello world msg while showing ur laptop > or as complicated as a room full of hardware > u have to use ur office chair as a laptop stand reply with ur submissions so i don’t miss them

TEN HOURS LEFT TO WIN DGX Sparks & Mac Minis > ur submission can be as simple as > sending hello world msg while showing ur laptop > or as complicated as a room full of hardware > u have to use ur office chair as a laptop stand reply with ur submissions so i don’t miss them

26,838 görüntüleme • 7 ay önce

Daha fazla içerik yok.