Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Running GLM-4.7-Flash with OpenCode locally on M4 Max MacBook Pro. 4-bit model runs at 82 tok/sec. Prefill will get ~4x faster with M5 Max MacBook Pro (~28 Jan). EXO will also support disaggregating prefill and decode across devices, e.g. DGX Spark.

Alex Cheema

50,434 subscribers

128,404 görüntüleme • 6 ay önce •via X (Twitter)

Eğitim Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Alex Cheema

56,555 görüntüleme • 6 ay önce

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Alex Cheema

62,221 görüntüleme • 6 ay önce

GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45.6 tok/s decode (gen) - 1340 tok/s prefill I could get 2x decode if I limit to 64k context (100 tok/s) In this video it operates Figma (:

GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45.6 tok/s decode (gen) - 1340 tok/s prefill I could get 2x decode if I limit to 64k context (100 tok/s) In this video it operates Figma (:

0xSero

74,772 görüntüleme • 3 ay önce

M4 Mac AI Coding Cluster Uses EXO Labs to run LLMs (here Qwen 2.5 Coder 32B at 18 tok/sec) distributed across 4 M4 Mac Minis (Thunderbolt 5 80Gbps) and a MacBook Pro M4 Max. Local alternative to Cursor (benchmark comparison soon).

M4 Mac AI Coding Cluster Uses EXO Labs to run LLMs (here Qwen 2.5 Coder 32B at 18 tok/sec) distributed across 4 M4 Mac Minis (Thunderbolt 5 80Gbps) and a MacBook Pro M4 Max. Local alternative to Cursor (benchmark comparison soon).

Alex Cheema

517,612 görüntüleme • 1 yıl önce

WIP: First attempt to speed up prefill for Flash-MoE. Original repo did token-by-token without streamed experts. Added: Batched linear attention + batched full attention (Flash Attention style) with custom Metal kernels. Without experts: 6.2x faster prefill (11 -> 68 tok/s) With experts at full-attn layers only: 1.9x faster (11 -> 20.5 tok/s) — same output quality Qwen3.5-397B, 4-bit, 209GB, M5 Max 128GB 1/3

WIP: First attempt to speed up prefill for Flash-MoE. Original repo did token-by-token without streamed experts. Added: Batched linear attention + batched full attention (Flash Attention style) with custom Metal kernels. Without experts: 6.2x faster prefill (11 -> 68 tok/s) With experts at full-attn layers only: 1.9x faster (11 -> 20.5 tok/s) — same output quality Qwen3.5-397B, 4-bit, 209GB, M5 Max 128GB 1/3

Anemll

19,572 görüntüleme • 4 ay önce

Running Llama-3-70B at home with exo intern Combines the compute of all these devices to make one big GPU: - iPhone 15 Pro Max - iPad Pro M4 - Galaxy S24 Ultra - MacBook Pro M2 and M3 Pro - 2 x MSI NVIDIA GeForce RTX 4090 SUPRIM Code is open source 👇

Running Llama-3-70B at home with exo intern Combines the compute of all these devices to make one big GPU: - iPhone 15 Pro Max - iPad Pro M4 - Galaxy S24 Ultra - MacBook Pro M2 and M3 Pro - 2 x MSI NVIDIA GeForce RTX 4090 SUPRIM Code is open source 👇

Alex Cheema

197,476 görüntüleme • 2 yıl önce

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Awni Hannun

196,508 görüntüleme • 1 yıl önce

I did it! It works! Using GLM-4.7-4bit with mlx_lm.server and opencode to fix real code locally! 🔥 Here single M3 Ultra 512GB, nex step phase will be 2 using Tensor Parallelism and then apply same changes to exo. Prefill is slow on a single machine, but generation is good.

I did it! It works! Using GLM-4.7-4bit with mlx_lm.server and opencode to fix real code locally! 🔥 Here single M3 Ultra 512GB, nex step phase will be 2 using Tensor Parallelism and then apply same changes to exo. Prefill is slow on a single machine, but generation is good.

Ivan Fioravanti ᯅ

44,000 görüntüleme • 6 ay önce

DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I'll release when more mature, but it is almost sure that it will get merged.

DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I'll release when more mature, but it is almost sure that it will get merged.

antirez

84,069 görüntüleme • 2 ay önce

What does it take to run 3, 5, or even 10 concurrent instances of Gemma 4 locally? We've open-sourced a demo letting you run multiple models side-by-side on your hardware. Gemma 4 26B A4B easily runs 10+ concurrent requests on a MacBook Pro M4 Max at 18 tokens/sec per request.

What does it take to run 3, 5, or even 10 concurrent instances of Gemma 4 locally? We've open-sourced a demo letting you run multiple models side-by-side on your hardware. Gemma 4 26B A4B easily runs 10+ concurrent requests on a MacBook Pro M4 Max at 18 tokens/sec per request.

Google Gemma

913,170 görüntüleme • 3 ay önce

I tested Gemma 4 12b on my Macbook Pro M5 Max with 128GB of unified memory. It runs VERY slow for a 12b parameter model. 44 tokens/second. The lava lamp that it produced was also terrible. I would stick to Qwen 3.6 27b or Qwen 3.6 35b. Google is still way behind, even with their local models.

I tested Gemma 4 12b on my Macbook Pro M5 Max with 128GB of unified memory. It runs VERY slow for a 12b parameter model. 44 tokens/second. The lava lamp that it produced was also terrible. I would stick to Qwen 3.6 27b or Qwen 3.6 35b. Google is still way behind, even with their local models.

BridgeMind

70,325 görüntüleme • 1 ay önce

Qwen 3.6 35B running at over 100 tokens per second on my $5,399 MacBook Pro M5 Max. This is the best local AI model I have ever run. 128GB of unified memory. No cloud. No API costs. No rate limits. Just raw local inference at speeds I didn't think were possible on a laptop. This model is more intelligent than GPT 5 on benchmarks. Running locally. On a MacBook. For free after the hardware cost. I said local AI would never compete with frontier. I'm starting to rethink that. The gap is closing faster than anyone expected.

Qwen 3.6 35B running at over 100 tokens per second on my $5,399 MacBook Pro M5 Max. This is the best local AI model I have ever run. 128GB of unified memory. No cloud. No API costs. No rate limits. Just raw local inference at speeds I didn't think were possible on a laptop. This model is more intelligent than GPT 5 on benchmarks. Running locally. On a MacBook. For free after the hardware cost. I said local AI would never compete with frontier. I'm starting to rethink that. The gap is closing faster than anyone expected.

BridgeMind

62,800 görüntüleme • 2 ay önce

Linear scaling achieved with multiple DeepSeek v3.1 instances. 4x macs = 4x throughput. 2x M3 Ultra Mac Studios = 1x DeepSeek @ 14 tok/sec 4x M3 Ultra Mac Studios = 2x DeepSeek @ 28 tok/sec DeepSeek V3.1 is a 671B parameter model - so at its native 8-bit quantization, it requires ~700GB of memory to run the model. EXO puts half of the layers on each device, combining their memory. EXO uses MLX distributed with TB5 interconnect, optimized for Apple Silicon. If we need higher throughput, adding two more devices lets us serve more users at once. EXO Labs handles all of this seamlessly - adding more devices to the cluster for linear scaling as we need it. The new EXO 1.0 will be open-source soonTM

Linear scaling achieved with multiple DeepSeek v3.1 instances. 4x macs = 4x throughput. 2x M3 Ultra Mac Studios = 1x DeepSeek @ 14 tok/sec 4x M3 Ultra Mac Studios = 2x DeepSeek @ 28 tok/sec DeepSeek V3.1 is a 671B parameter model - so at its native 8-bit quantization, it requires ~700GB of memory to run the model. EXO puts half of the layers on each device, combining their memory. EXO uses MLX distributed with TB5 interconnect, optimized for Apple Silicon. If we need higher throughput, adding two more devices lets us serve more users at once. EXO Labs handles all of this seamlessly - adding more devices to the cluster for linear scaling as we need it. The new EXO 1.0 will be open-source soonTM

Matt Beton

158,485 görüntüleme • 11 ay önce

ParoQuant just got a big upgrade 🚀 ✅ Supports the new Qwen3.5 models ⚡ Now runs on MLX (fast local inference on Apple Silicon) 🧠 Preserves reasoning quality with 4-bit quantization We also built an agent demo running locally on my 4-year-old M2 Max. Can't wait to upgrade to an M5 Max and see what kind of magic we can do. ✨

ParoQuant just got a big upgrade 🚀 ✅ Supports the new Qwen3.5 models ⚡ Now runs on MLX (fast local inference on Apple Silicon) 🧠 Preserves reasoning quality with 4-bit quantization We also built an agent demo running locally on my 4-year-old M2 Max. Can't wait to upgrade to an M5 Max and see what kind of magic we can do. ✨

Zhijian Liu

49,358 görüntüleme • 4 ay önce

MiniMax M2.7 is 230B params. Can you actually run it at home? I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs: 🟠 4x RTX 4090 (96GB): 71.52 tok/s, TTFT 1045ms 🟢 4x RTX 5090 (128GB): 120.54 tok/s, TTFT 725ms 🟡 1x RTX PRO 6000 (96GB): 118.74 tok/s, TTFT 765ms 🟣 DGX Spark (128GB) — 24.41 tok/s, TTFT 741ms Backend: llama.cpp. Context: 32k. Max tokens: 4096. I went with IQ3_XXS because it's the biggest quant that fits in 96GB VRAM while still leaving safe headroom for 32k context. Same quant across all four rigs, fairest comparison I could run. Now look at rough peak GPU power draw: 🟠 4x4090 → 1,800W peak (450W × 4) 🟢 4x5090 → 2,300W peak (575W × 4) 🟡 RTX PRO 6000 → 600W peak 🟣 DGX Spark → 240W peak (whole system) The RTX PRO 6000 is the quiet winner. One card, 96GB, matching a 4x5090 rig at roughly a quarter of the power and zero multi-GPU headaches. Best tokens-per-watt by a wide margin. DGX Spark is slow on generation but pulls the least power of any rig here, around 240W for the whole system. Prefill-friendly, memory-rich, wall-socket-friendly. And yes, plenty of people cap their cards. Even then, 4x 4090 or 4x 5090 still pulls well over 1,200W from the GPUs alone.

MiniMax M2.7 is 230B params. Can you actually run it at home? I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs: 🟠 4x RTX 4090 (96GB): 71.52 tok/s, TTFT 1045ms 🟢 4x RTX 5090 (128GB): 120.54 tok/s, TTFT 725ms 🟡 1x RTX PRO 6000 (96GB): 118.74 tok/s, TTFT 765ms 🟣 DGX Spark (128GB) — 24.41 tok/s, TTFT 741ms Backend: llama.cpp. Context: 32k. Max tokens: 4096. I went with IQ3_XXS because it's the biggest quant that fits in 96GB VRAM while still leaving safe headroom for 32k context. Same quant across all four rigs, fairest comparison I could run. Now look at rough peak GPU power draw: 🟠 4x4090 → 1,800W peak (450W × 4) 🟢 4x5090 → 2,300W peak (575W × 4) 🟡 RTX PRO 6000 → 600W peak 🟣 DGX Spark → 240W peak (whole system) The RTX PRO 6000 is the quiet winner. One card, 96GB, matching a 4x5090 rig at roughly a quarter of the power and zero multi-GPU headaches. Best tokens-per-watt by a wide margin. DGX Spark is slow on generation but pulls the least power of any rig here, around 240W for the whole system. Prefill-friendly, memory-rich, wall-socket-friendly. And yes, plenty of people cap their cards. Even then, 4x 4090 or 4x 5090 still pulls well over 1,200W from the GPUs alone.

stevibe

191,782 görüntüleme • 3 ay önce

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Alok

119,821 görüntüleme • 1 ay önce

llama3 8B (not quantized) running on an heterogeneous home cluster made of: - iPhone 15 Pro Max - iPad Pro (not sure which version XD) - MacBook Pro ( M1 Max ) - NVIDIA GeForce 3080 (not visible in video) - 2x NVIDIA Titan X Pascal Very soon also supporting Android (I *have* to also add my NVIDIA Shield GPU!!!!!). Single code base, single model format (reduced and optimally distributed to every node to save space). Everything (including iOS code) is open here ... it would be really nice, with the help of the community, taking this project to the next level in terms of optimization and support. My vision is about a distributed inference server that can run any model on any backend in any cluster topology - let's fight programmed obsolescence and democratize inference!

llama3 8B (not quantized) running on an heterogeneous home cluster made of: - iPhone 15 Pro Max - iPad Pro (not sure which version XD) - MacBook Pro ( M1 Max ) - NVIDIA GeForce 3080 (not visible in video) - 2x NVIDIA Titan X Pascal Very soon also supporting Android (I have to also add my NVIDIA Shield GPU!!!!!). Single code base, single model format (reduced and optimally distributed to every node to save space). Everything (including iOS code) is open here ... it would be really nice, with the help of the community, taking this project to the next level in terms of optimization and support. My vision is about a distributed inference server that can run any model on any backend in any cluster topology - let's fight programmed obsolescence and democratize inference!

Simone Margaritelli

304,072 görüntüleme • 2 yıl önce

dflash-mlx v0.1.7 is out. Big adaptive-runtime update, still focused mostly on Qwen3.6 27B 4-bit. @ 2048 tokens, M5 Max, stock mlx_lm baseline: ► 1024: 33.26 → 98.05 tok/s (x2.95) ► 2048: 32.34 → 90.67 tok/s (x2.81) ► 4096: 30.58 → 93.55 tok/s (x3.06) ► 8192: 26.03 → 79.12 tok/s (x3.04) ► 16384: 21.50 → 60.77 tok/s (x2.78) Main change: adaptive verify got a lot smarter. Instead of blindly trying to verify large 16-token blocks all the time, DFlash now watches acceptance + tokens/cycle + real cycle cost. When the draft gets weaker, it drops to smaller 4-token blocks, then probes back up only when the recent cycles make sense. In practice: less wasted verify work, better long-context behavior, and much more useful metrics to understand what is happening. ► retuned adaptive verify for long-context / agentic decode ► richer metrics: tokens/cycle, adaptive block state, CopySpec counters ► /metrics now has real decode avg + logical/real/restored prefill rates ► AIME25 benchmark suite with exact integer scoring ► Qwen thinking default now follows tokenizer/request behavior ► GDN recurrent exactness fixes I also started running AIME25-style long generations. Even around 45k generated tokens, I was still seeing ~40 tok/s on 27B 4-bit. Over the next few days I’ll share more demos: AIME runs, real OpenCode game/project sessions, and full metrics along the way. Still optimizing hard for 27B 4-bit first, while working on custom kernels per Apple GPU generation so more machines can benefit.

dflash-mlx v0.1.7 is out. Big adaptive-runtime update, still focused mostly on Qwen3.6 27B 4-bit. @ 2048 tokens, M5 Max, stock mlx_lm baseline: ► 1024: 33.26 → 98.05 tok/s (x2.95) ► 2048: 32.34 → 90.67 tok/s (x2.81) ► 4096: 30.58 → 93.55 tok/s (x3.06) ► 8192: 26.03 → 79.12 tok/s (x3.04) ► 16384: 21.50 → 60.77 tok/s (x2.78) Main change: adaptive verify got a lot smarter. Instead of blindly trying to verify large 16-token blocks all the time, DFlash now watches acceptance + tokens/cycle + real cycle cost. When the draft gets weaker, it drops to smaller 4-token blocks, then probes back up only when the recent cycles make sense. In practice: less wasted verify work, better long-context behavior, and much more useful metrics to understand what is happening. ► retuned adaptive verify for long-context / agentic decode ► richer metrics: tokens/cycle, adaptive block state, CopySpec counters ► /metrics now has real decode avg + logical/real/restored prefill rates ► AIME25 benchmark suite with exact integer scoring ► Qwen thinking default now follows tokenizer/request behavior ► GDN recurrent exactness fixes I also started running AIME25-style long generations. Even around 45k generated tokens, I was still seeing ~40 tok/s on 27B 4-bit. Over the next few days I’ll share more demos: AIME runs, real OpenCode game/project sessions, and full metrics along the way. Still optimizing hard for 27B 4-bit first, while working on custom kernels per Apple GPU generation so more machines can benefit.

bstn 👁️

16,334 görüntüleme • 2 ay önce

Gemma 4 31B is way too slow on my MacBook Pro M5 Max with 128GB of unified memory. 42 seconds to generate the first line of code. 21 tokens per second after that. Uses 26GB of memory when loaded. I sat there waiting nearly a minute before it even started writing code. The output? Actually decent. The lava lamp it generated looks good. But 42 seconds to first token is a dealbreaker. Claude Opus 4.7 starts generating in under 2 seconds. GPT 5.5 is instant. Good model. Unusable speed. Local inference on a $5,399 machine still can't match the cloud.

Gemma 4 31B is way too slow on my MacBook Pro M5 Max with 128GB of unified memory. 42 seconds to generate the first line of code. 21 tokens per second after that. Uses 26GB of memory when loaded. I sat there waiting nearly a minute before it even started writing code. The output? Actually decent. The lava lamp it generated looks good. But 42 seconds to first token is a dealbreaker. Claude Opus 4.7 starts generating in under 2 seconds. GPT 5.5 is instant. Good model. Unusable speed. Local inference on a $5,399 machine still can't match the cloud.

BridgeMind

40,178 görüntüleme • 2 ay önce