Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

For even higher throughput and lower latency: batch generation + tensor parallel with mlx-lm + and mlx.distributed. Here it's generating at 63 tok/sec (throughput) with GLM 4.7 in 6-bit and batch size 4 on 4 M3 Ultras:

Awni Hannun

43,893 subscribers

22,721 views • 7 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Alex Cheema

56,555 views • 6 months ago

GLM-4.7 runs quite well on an M3 Ultra with mlx-lm, even at a near lossless precision (6-bit here). It generated the best space invaders game I've seen yet for a local model (even included sound effects!). Generated 6600 tokens and ran at 16 tok/s.

GLM-4.7 runs quite well on an M3 Ultra with mlx-lm, even at a near lossless precision (6-bit here). It generated the best space invaders game I've seen yet for a local model (even included sound effects!). Generated 6600 tokens and ran at 16 tok/s.

Awni Hannun

237,169 views • 7 months ago

MiniMax M2.1 in 4-bit cruises on an M3 Ultra with mlx-lm. Generated a space invaders game using 5098 tokens at 47.2 tok/sec:

MiniMax M2.1 in 4-bit cruises on an M3 Ultra with mlx-lm. Generated a space invaders game using 5098 tokens at 47.2 tok/sec:

Awni Hannun

93,826 views • 7 months ago

First results are in. Llama 4 Maverick 17B active / 400B total is blazing fast with MLX on an M3 Ultra. Here is the 4-bit model generating 1100 tokens at 50 tok/sec:

First results are in. Llama 4 Maverick 17B active / 400B total is blazing fast with MLX on an M3 Ultra. Here is the 4-bit model generating 1100 tokens at 50 tok/sec:

Awni Hannun

149,855 views • 1 year ago

GLM 4.7 Flash is supported in mlx-lm 0.30.3 (h/t Ivan Fioravanti ᯅ) The 4-bit runs fast (43 tok/s generation, ~800 tok/s prefill) on a base M5 32GB laptop.

GLM 4.7 Flash is supported in mlx-lm 0.30.3 (h/t Ivan Fioravanti ᯅ) The 4-bit runs fast (43 tok/s generation, ~800 tok/s prefill) on a base M5 32GB laptop.

Awni Hannun

141,194 views • 6 months ago

DeepSeek R1 671B running on 2 M2 Ultras faster than reading speed. Getting close to open-source O1, at home, on consumer hardware. With mlx.distributed and mlx-lm, 3-bit quantization (~4 bpw)

DeepSeek R1 671B running on 2 M2 Ultras faster than reading speed. Getting close to open-source O1, at home, on consumer hardware. With mlx.distributed and mlx-lm, 3-bit quantization (~4 bpw)

Awni Hannun

866,117 views • 1 year ago

The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm!

The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm!

Awni Hannun

168,896 views • 1 year ago

The new Kimi K2 1T model (4-bit quant) runs on 2 512GB M3 Ultras with mlx-lm and mx.distributed. 1 trillion params, at a speed that's actually quite usable:

The new Kimi K2 1T model (4-bit quant) runs on 2 512GB M3 Ultras with mlx-lm and mx.distributed. 1 trillion params, at a speed that's actually quite usable:

Awni Hannun

238,057 views • 1 year ago

QLoRA fine-tuning 4-bit Gemma 2B on iPhone 15 Pro with MLX Swift. A nice size for fine-tuning on device, getting 70-100 toks/sec depending on the batch. Guide here:

QLoRA fine-tuning 4-bit Gemma 2B on iPhone 15 Pro with MLX Swift. A nice size for fine-tuning on device, getting 70-100 toks/sec depending on the batch. Guide here:

Awni Hannun

19,530 views • 2 years ago

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Alex Cheema

62,221 views • 6 months ago

Kimi K2.5 1T runs on 2 M3 Ultras with mlx-lm in it's native precision. It's actually quite usable. Here it's making a space invaders game. Generated 3856 tokens at 21.9 tok/sec using 350GB per machine. Thanks to Tarjei Mandt for the port.

Kimi K2.5 1T runs on 2 M3 Ultras with mlx-lm in it's native precision. It's actually quite usable. Here it's making a space invaders game. Generated 3856 tokens at 21.9 tok/sec using 350GB per machine. Thanks to Tarjei Mandt for the port.

Awni Hannun

241,472 views • 6 months ago

GLM 4.6 runs quite fast on an M3 Ultra with mlx-lm even at higher precision. Pretty remarkable that it benchmarks competitive to the just-released Sonnet 4.5. Hope those benchmarks hold-up in day-to-day use. Here's a run using 5.5 bpw quantized model, generating 5.3k tokens at 17+ tok/sec using 244 GB. What prompts should I test?

GLM 4.6 runs quite fast on an M3 Ultra with mlx-lm even at higher precision. Pretty remarkable that it benchmarks competitive to the just-released Sonnet 4.5. Hope those benchmarks hold-up in day-to-day use. Here's a run using 5.5 bpw quantized model, generating 5.3k tokens at 17+ tok/sec using 244 GB. What prompts should I test?

Awni Hannun

68,539 views • 10 months ago

Llama 3.3 70B 4-bit runs nicely on a 64GB M3 Max with in MLX LM (~10 toks/sec). Would be even faster on an M4 Max. Yesterday's server-only 405B is today's laptop 70B:

Llama 3.3 70B 4-bit runs nicely on a 64GB M3 Max with in MLX LM (~10 toks/sec). Would be even faster on an M4 Max. Yesterday's server-only 405B is today's laptop 70B:

Awni Hannun

110,534 views • 1 year ago

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

Awni Hannun

60,599 views • 5 months ago

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Awni Hannun

196,508 views • 1 year ago

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

Angelos Katharopoulos

66,202 views • 3 months ago

Kimi K2.6 was released 1h ago, and it looks amazing! Here it's running with MLX (mlx-vlm) on two M3 Ultras (full 1T param VLM) 🔥

Kimi K2.6 was released 1h ago, and it looks amazing! Here it's running with MLX (mlx-vlm) on two M3 Ultras (full 1T param VLM) 🔥

Pedro Cuenca

65,768 views • 3 months ago

DeepSeek R1 (the full 680B model) runs nicely in higher quality 4-bit on 3 M2 Ultras with MLX. Asked it a coding question and it thought for ~2k tokens and generated 3500 tokens overall:

DeepSeek R1 (the full 680B model) runs nicely in higher quality 4-bit on 3 M2 Ultras with MLX. Asked it a coding question and it thought for ~2k tokens and generated 3500 tokens overall:

Awni Hannun

997,247 views • 1 year ago

DeepSeek R1 (the 680B MOE) is ~20% faster in the latest mlx / mlx-lm. 4-bit model on 3 M2 Ultras generates 4k tokens at a respectable 15 toks/sec. Plus some QoL improvements: - Only downloads the local shard (much faster startup) - Distributed launcher ships with MLX

DeepSeek R1 (the 680B MOE) is ~20% faster in the latest mlx / mlx-lm. 4-bit model on 3 M2 Ultras generates 4k tokens at a respectable 15 toks/sec. Plus some QoL improvements: - Only downloads the local shard (much faster startup) - Distributed launcher ships with MLX

Awni Hannun

86,921 views • 1 year ago

Few seconds and <4GB to full fine tune Gemma 3 270M for classification tasks! Here on M3 Ultra! I'll keep experimenting and checking final result! base model, batch size 4, adaw (testing with adafactor too!), rank 128. Video at normal speed here! 🔥

Few seconds and <4GB to full fine tune Gemma 3 270M for classification tasks! Here on M3 Ultra! I'll keep experimenting and checking final result! base model, batch size 4, adaw (testing with adafactor too!), rank 128. Video at normal speed here! 🔥

Ivan Fioravanti ᯅ

35,282 views • 11 months ago