Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Alex Cheema

49,271 subscribers

62,144 görüntüleme • 4 ay önce •via X (Twitter)

Haberler & Politika Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Alex Cheema

55,943 görüntüleme • 4 ay önce

M4 Mac AI Coding Cluster Uses EXO Labs to run LLMs (here Qwen 2.5 Coder 32B at 18 tok/sec) distributed across 4 M4 Mac Minis (Thunderbolt 5 80Gbps) and a MacBook Pro M4 Max. Local alternative to Cursor (benchmark comparison soon).

M4 Mac AI Coding Cluster Uses EXO Labs to run LLMs (here Qwen 2.5 Coder 32B at 18 tok/sec) distributed across 4 M4 Mac Minis (Thunderbolt 5 80Gbps) and a MacBook Pro M4 Max. Local alternative to Cursor (benchmark comparison soon).

Alex Cheema

517,028 görüntüleme • 1 yıl önce

Running GLM-4.7-Flash with OpenCode locally on M4 Max MacBook Pro. 4-bit model runs at 82 tok/sec. Prefill will get ~4x faster with M5 Max MacBook Pro (~28 Jan). EXO will also support disaggregating prefill and decode across devices, e.g. DGX Spark.

Running GLM-4.7-Flash with OpenCode locally on M4 Max MacBook Pro. 4-bit model runs at 82 tok/sec. Prefill will get ~4x faster with M5 Max MacBook Pro (~28 Jan). EXO will also support disaggregating prefill and decode across devices, e.g. DGX Spark.

Alex Cheema

127,646 görüntüleme • 4 ay önce

Llama 3 running locally on iPhone with MLX Built by exo intern team Mohamed Baioumy h/t Awni Hannun MLX & Prince Canuma for the port

Llama 3 running locally on iPhone with MLX Built by exo intern team Mohamed Baioumy h/t Awni Hannun MLX & Prince Canuma for the port

Alex Cheema

295,820 görüntüleme • 2 yıl önce

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

Ivan Fioravanti ᯅ

327,687 görüntüleme • 5 ay önce

For even higher throughput and lower latency: batch generation + tensor parallel with mlx-lm + and mlx.distributed. Here it's generating at 63 tok/sec (throughput) with GLM 4.7 in 6-bit and batch size 4 on 4 M3 Ultras:

For even higher throughput and lower latency: batch generation + tensor parallel with mlx-lm + and mlx.distributed. Here it's generating at 63 tok/sec (throughput) with GLM 4.7 in 6-bit and batch size 4 on 4 M3 Ultras:

Awni Hannun

22,721 görüntüleme • 5 ay önce

GLM 4.7 Flash is supported in mlx-lm 0.30.3 (h/t Ivan Fioravanti ᯅ) The 4-bit runs fast (43 tok/s generation, ~800 tok/s prefill) on a base M5 32GB laptop.

GLM 4.7 Flash is supported in mlx-lm 0.30.3 (h/t Ivan Fioravanti ᯅ) The 4-bit runs fast (43 tok/s generation, ~800 tok/s prefill) on a base M5 32GB laptop.

Awni Hannun

141,194 görüntüleme • 4 ay önce

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Awni Hannun

196,482 görüntüleme • 10 ay önce

GLM5 running on a cluster of 4 Mac Studios using tensor parallelism thanks to MLX distributed. Yes, those are the Alex drawers.

GLM5 running on a cluster of 4 Mac Studios using tensor parallelism thanks to MLX distributed. Yes, those are the Alex drawers.

Alex Ziskind

20,594 görüntüleme • 4 ay önce

Linear scaling achieved with multiple DeepSeek v3.1 instances. 4x macs = 4x throughput. 2x M3 Ultra Mac Studios = 1x DeepSeek @ 14 tok/sec 4x M3 Ultra Mac Studios = 2x DeepSeek @ 28 tok/sec DeepSeek V3.1 is a 671B parameter model - so at its native 8-bit quantization, it requires ~700GB of memory to run the model. EXO puts half of the layers on each device, combining their memory. EXO uses MLX distributed with TB5 interconnect, optimized for Apple Silicon. If we need higher throughput, adding two more devices lets us serve more users at once. EXO Labs handles all of this seamlessly - adding more devices to the cluster for linear scaling as we need it. The new EXO 1.0 will be open-source soonTM

Linear scaling achieved with multiple DeepSeek v3.1 instances. 4x macs = 4x throughput. 2x M3 Ultra Mac Studios = 1x DeepSeek @ 14 tok/sec 4x M3 Ultra Mac Studios = 2x DeepSeek @ 28 tok/sec DeepSeek V3.1 is a 671B parameter model - so at its native 8-bit quantization, it requires ~700GB of memory to run the model. EXO puts half of the layers on each device, combining their memory. EXO uses MLX distributed with TB5 interconnect, optimized for Apple Silicon. If we need higher throughput, adding two more devices lets us serve more users at once. EXO Labs handles all of this seamlessly - adding more devices to the cluster for linear scaling as we need it. The new EXO 1.0 will be open-source soonTM

Matt Beton

158,485 görüntüleme • 9 ay önce

MiniMax M2.1 in 4-bit cruises on an M3 Ultra with mlx-lm. Generated a space invaders game using 5098 tokens at 47.2 tok/sec:

MiniMax M2.1 in 4-bit cruises on an M3 Ultra with mlx-lm. Generated a space invaders game using 5098 tokens at 47.2 tok/sec:

Awni Hannun

93,826 görüntüleme • 5 ay önce

Thanks for the mention TBPN! We're working on making this run at >40 tok/sec on 2 Mac Studios (the full, unquantized Kimi K2.5). I'm already running Kimi K2.5 myself locally with Claude Code and it's impressive - very close to Opus level.

Thanks for the mention TBPN! We're working on making this run at >40 tok/sec on 2 Mac Studios (the full, unquantized Kimi K2.5). I'm already running Kimi K2.5 myself locally with Claude Code and it's impressive - very close to Opus level.

Alex Cheema

40,471 görüntüleme • 4 ay önce

Update: Qwen3.5:9b head-to-head (MLX, Apple Silicon optimized) Mac Studio M2 Ultra: 89.74 tok/s Mac Mini M4: 20.82 tok/s MLX basically doubles the speed on both machines.

Update: Qwen3.5:9b head-to-head (MLX, Apple Silicon optimized) Mac Studio M2 Ultra: 89.74 tok/s Mac Mini M4: 20.82 tok/s MLX basically doubles the speed on both machines.

stevibe

134,256 görüntüleme • 3 ay önce

GLM 4.6 runs quite fast on an M3 Ultra with mlx-lm even at higher precision. Pretty remarkable that it benchmarks competitive to the just-released Sonnet 4.5. Hope those benchmarks hold-up in day-to-day use. Here's a run using 5.5 bpw quantized model, generating 5.3k tokens at 17+ tok/sec using 244 GB. What prompts should I test?

GLM 4.6 runs quite fast on an M3 Ultra with mlx-lm even at higher precision. Pretty remarkable that it benchmarks competitive to the just-released Sonnet 4.5. Hope those benchmarks hold-up in day-to-day use. Here's a run using 5.5 bpw quantized model, generating 5.3k tokens at 17+ tok/sec using 244 GB. What prompts should I test?

Awni Hannun

68,539 görüntüleme • 8 ay önce

I did it! It works! Using GLM-4.7-4bit with mlx_lm.server and opencode to fix real code locally! 🔥 Here single M3 Ultra 512GB, nex step phase will be 2 using Tensor Parallelism and then apply same changes to exo. Prefill is slow on a single machine, but generation is good.

I did it! It works! Using GLM-4.7-4bit with mlx_lm.server and opencode to fix real code locally! 🔥 Here single M3 Ultra 512GB, nex step phase will be 2 using Tensor Parallelism and then apply same changes to exo. Prefill is slow on a single machine, but generation is good.

Ivan Fioravanti ᯅ

44,000 görüntüleme • 5 ay önce

Qwen3 235B MoE (22B active) runs so fast on an M2 Ultra with mlx-lm. - 4-bit model uses ~132GB - Generated 580 tokens at ~28 toks/sec

Qwen3 235B MoE (22B active) runs so fast on an M2 Ultra with mlx-lm. - 4-bit model uses ~132GB - Generated 580 tokens at ~28 toks/sec

Awni Hannun

117,763 görüntüleme • 1 yıl önce

First results are in. Llama 4 Maverick 17B active / 400B total is blazing fast with MLX on an M3 Ultra. Here is the 4-bit model generating 1100 tokens at 50 tok/sec:

First results are in. Llama 4 Maverick 17B active / 400B total is blazing fast with MLX on an M3 Ultra. Here is the 4-bit model generating 1100 tokens at 50 tok/sec:

Awni Hannun

149,847 görüntüleme • 1 yıl önce

Kimi K2.5 1T runs on 2 M3 Ultras with mlx-lm in it's native precision. It's actually quite usable. Here it's making a space invaders game. Generated 3856 tokens at 21.9 tok/sec using 350GB per machine. Thanks to Tarjei Mandt for the port.

Kimi K2.5 1T runs on 2 M3 Ultras with mlx-lm in it's native precision. It's actually quite usable. Here it's making a space invaders game. Generated 3856 tokens at 21.9 tok/sec using 350GB per machine. Thanks to Tarjei Mandt for the port.

Awni Hannun

241,472 görüntüleme • 4 ay önce

GLM-4.7 runs quite well on an M3 Ultra with mlx-lm, even at a near lossless precision (6-bit here). It generated the best space invaders game I've seen yet for a local model (even included sound effects!). Generated 6600 tokens and ran at 16 tok/s.

GLM-4.7 runs quite well on an M3 Ultra with mlx-lm, even at a near lossless precision (6-bit here). It generated the best space invaders game I've seen yet for a local model (even included sound effects!). Generated 6600 tokens and ran at 16 tok/s.

Awni Hannun

233,002 görüntüleme • 5 ay önce

DAAUUUMMMM! Deep Seek R1 - 4bit on a single Mac Studio 512gb. 18.26 Tokens per second with MLX. Took over a minute to load the model but I sped that up. Generation was great! thanks Awni Hannun mlx is the future.

DAAUUUMMMM! Deep Seek R1 - 4bit on a single Mac Studio 512gb. 18.26 Tokens per second with MLX. Took over a minute to load the model but I sped that up. Generation was great! thanks Awni Hannun mlx is the future.

Austin Vance

172,817 görüntüleme • 1 yıl önce