正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Running Ring 1T reasoning model on a single M3 Ultra with mlx-lm. It's quantized to 3.5 bits-per-weight. Uses 440GB and generated ~6k tokens at 18.2 toks/sec. Getting closer to GPT-5 at home.

Awni Hannun

44,946 subscribers

55,131 次观看 • 8 个月前 •via X (Twitter)

教育科学技术

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Qwen3 235B MoE (22B active) runs so fast on an M2 Ultra with mlx-lm. - 4-bit model uses ~132GB - Generated 580 tokens at ~28 toks/sec

Qwen3 235B MoE (22B active) runs so fast on an M2 Ultra with mlx-lm. - 4-bit model uses ~132GB - Generated 580 tokens at ~28 toks/sec

Awni Hannun

117,763 次观看 • 1 年前

Kimi K2.5 1T runs on 2 M3 Ultras with mlx-lm in it's native precision. It's actually quite usable. Here it's making a space invaders game. Generated 3856 tokens at 21.9 tok/sec using 350GB per machine. Thanks to Tarjei Mandt for the port.

Kimi K2.5 1T runs on 2 M3 Ultras with mlx-lm in it's native precision. It's actually quite usable. Here it's making a space invaders game. Generated 3856 tokens at 21.9 tok/sec using 350GB per machine. Thanks to Tarjei Mandt for the port.

Awni Hannun

241,472 次观看 • 4 个月前

MiniMax M2.1 in 4-bit cruises on an M3 Ultra with mlx-lm. Generated a space invaders game using 5098 tokens at 47.2 tok/sec:

MiniMax M2.1 in 4-bit cruises on an M3 Ultra with mlx-lm. Generated a space invaders game using 5098 tokens at 47.2 tok/sec:

Awni Hannun

93,826 次观看 • 5 个月前

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

Awni Hannun

186,641 次观看 • 11 个月前

The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm!

The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm!

Awni Hannun

168,842 次观看 • 1 年前

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Awni Hannun

196,482 次观看 • 10 个月前

What is this Nanbeige4.1-3B model running at - 77 toks/s in bf16 (in video) - 115 toks/s in 8bit on M3 Ultra with MLX with these benchmark scores! 🔥

What is this Nanbeige4.1-3B model running at - 77 toks/s in bf16 (in video) - 115 toks/s in 8bit on M3 Ultra with MLX with these benchmark scores! 🔥

Ivan Fioravanti ᯅ

85,228 次观看 • 4 个月前

The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality! The model was quantization aware trained (qat) at int4. Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm:

The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality! The model was quantization aware trained (qat) at int4. Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm:

Awni Hannun

500,714 次观看 • 7 个月前

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

Awni Hannun

60,446 次观看 • 4 个月前

GLM 4.6 runs quite fast on an M3 Ultra with mlx-lm even at higher precision. Pretty remarkable that it benchmarks competitive to the just-released Sonnet 4.5. Hope those benchmarks hold-up in day-to-day use. Here's a run using 5.5 bpw quantized model, generating 5.3k tokens at 17+ tok/sec using 244 GB. What prompts should I test?

GLM 4.6 runs quite fast on an M3 Ultra with mlx-lm even at higher precision. Pretty remarkable that it benchmarks competitive to the just-released Sonnet 4.5. Hope those benchmarks hold-up in day-to-day use. Here's a run using 5.5 bpw quantized model, generating 5.3k tokens at 17+ tok/sec using 244 GB. What prompts should I test?

Awni Hannun

68,539 次观看 • 8 个月前

GLM-4.7 runs quite well on an M3 Ultra with mlx-lm, even at a near lossless precision (6-bit here). It generated the best space invaders game I've seen yet for a local model (even included sound effects!). Generated 6600 tokens and ran at 16 tok/s.

GLM-4.7 runs quite well on an M3 Ultra with mlx-lm, even at a near lossless precision (6-bit here). It generated the best space invaders game I've seen yet for a local model (even included sound effects!). Generated 6600 tokens and ran at 16 tok/s.

Awni Hannun

233,002 次观看 • 5 个月前

The new Kimi K2 1T model (4-bit quant) runs on 2 512GB M3 Ultras with mlx-lm and mx.distributed. 1 trillion params, at a speed that's actually quite usable:

The new Kimi K2 1T model (4-bit quant) runs on 2 512GB M3 Ultras with mlx-lm and mx.distributed. 1 trillion params, at a speed that's actually quite usable:

Awni Hannun

238,005 次观看 • 11 个月前

DeepSeek R1 (the 680B MOE) is ~20% faster in the latest mlx / mlx-lm. 4-bit model on 3 M2 Ultras generates 4k tokens at a respectable 15 toks/sec. Plus some QoL improvements: - Only downloads the local shard (much faster startup) - Distributed launcher ships with MLX

DeepSeek R1 (the 680B MOE) is ~20% faster in the latest mlx / mlx-lm. 4-bit model on 3 M2 Ultras generates 4k tokens at a respectable 15 toks/sec. Plus some QoL improvements: - Only downloads the local shard (much faster startup) - Distributed launcher ships with MLX

Awni Hannun

86,921 次观看 • 1 年前

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

Ivan Fioravanti ᯅ

327,687 次观看 • 5 个月前

Qwen 32B (4-bit) generates at >40 toks/sec on an M4 Max with assisted decoding and Qwen 0.5B as the draft model. Coming soon to mlx-lm. Compare regular decoding (left) to assisted decoding (right):

Qwen 32B (4-bit) generates at >40 toks/sec on an M4 Max with assisted decoding and Qwen 0.5B as the draft model. Coming soon to mlx-lm. Compare regular decoding (left) to assisted decoding (right):

Awni Hannun

50,353 次观看 • 1 年前

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

Angelos Katharopoulos

66,095 次观看 • 2 个月前

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here's the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here's the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.

Awni Hannun

48,657 次观看 • 4 个月前

First results are in. Llama 4 Maverick 17B active / 400B total is blazing fast with MLX on an M3 Ultra. Here is the 4-bit model generating 1100 tokens at 50 tok/sec:

First results are in. Llama 4 Maverick 17B active / 400B total is blazing fast with MLX on an M3 Ultra. Here is the 4-bit model generating 1100 tokens at 50 tok/sec:

Awni Hannun

149,847 次观看 • 1 年前

Kimi K2.6 was released 1h ago, and it looks amazing! Here it's running with MLX (mlx-vlm) on two M3 Ultras (full 1T param VLM) 🔥

Kimi K2.6 was released 1h ago, and it looks amazing! Here it's running with MLX (mlx-vlm) on two M3 Ultras (full 1T param VLM) 🔥

Pedro Cuenca

65,682 次观看 • 1 个月前

Reasoning on Apple Vision Pro with Apple MLX and DeepSeek R1 Qwen 7B 4bit! 🔥 14 tokens per sec! 🔥 Note: sorry for the shaking footage, I was excited to see it running 😂

Reasoning on Apple Vision Pro with Apple MLX and DeepSeek R1 Qwen 7B 4bit! 🔥 14 tokens per sec! 🔥 Note: sorry for the shaking footage, I was excited to see it running 😂

Ivan Fioravanti ᯅ

24,324 次观看 • 1 年前