Awni Hannun's banner
Awni Hannun's profile picture

Awni Hannun

@awnihannun44,820 subscribers

ow knee

Shorts

DeepSeek R1 distilled to Qwen 1.5B easily runs on my iPhone 16 with MLX swift. Here's the 4-bit model reasoning entirely on device at almost 60 toks/sec:

DeepSeek R1 distilled to Qwen 1.5B easily runs on my iPhone 16 with MLX swift. Here's the 4-bit model reasoning entirely on device at almost 60 toks/sec:

1,138,564 Aufrufe

Llama 3.2 1B in 4-bit runs at ~60 toks/sec with MLX Swift on my iPhone 15 pro. It's quite good and easily runs on-device:

Llama 3.2 1B in 4-bit runs at ~60 toks/sec with MLX Swift on my iPhone 15 pro. It's quite good and easily runs on-device:

492,305 Aufrufe

mlx_lm.server + Qwen3 Coder Next 6bit + OpenCode + M3 Ultra = a pretty capable and very fast local coding setup. Not sped up:

mlx_lm.server + Qwen3 Coder Next 6bit + OpenCode + M3 Ultra = a pretty capable and very fast local coding setup. Not sped up:

66,619 Aufrufe

Managed to get Ling Mini 16B (1.4B active) running on my iPhone Air. It runs very fast with MLX. It's a DWQ of Ling Mini quantized to 3 bits-per-weight. A 16B model running on an Air at this speed is pretty awesome:

Managed to get Ling Mini 16B (1.4B active) running on my iPhone Air. It runs very fast with MLX. It's a DWQ of Ling Mini quantized to 3 bits-per-weight. A 16B model running on an Air at this speed is pretty awesome:

92,422 Aufrufe

We (Claude and I) made a new package: voxmlx It's an MLX implementation of Mistral's Voxtral mini realtime speech recognition model. It supports streaming audio and runs pretty fast on a laptop. To use it, simply: `uvx voxmlx` Also I did not write a single line of code for this package. Every line of code was written by Claude Code. More on that in thread.

We (Claude and I) made a new package: voxmlx It's an MLX implementation of Mistral's Voxtral mini realtime speech recognition model. It supports streaming audio and runs pretty fast on a laptop. To use it, simply: `uvx voxmlx` Also I did not write a single line of code for this package. Every line of code was written by Claude Code. More on that in thread.

32,410 Aufrufe

Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).

Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).

35,078 Aufrufe

No-one: But can you do 16 generations on your M4 laptop simultaneously? MLX LM:

No-one: But can you do 16 generations on your M4 laptop simultaneously? MLX LM:

46,707 Aufrufe

Cool video (sped-up) running Mixtral 8x22B on a M3 Max (h/t PSA since I get this question a lot: asitop is the monitoring tool on the right (pip install asitop) glances is on the bottom (pip install glances)

Cool video (sped-up) running Mixtral 8x22B on a M3 Max (h/t PSA since I get this question a lot: asitop is the monitoring tool on the right (pip install asitop) glances is on the bottom (pip install glances)

70,804 Aufrufe

Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in memory (not just the active parameters). For an MOE at each token you access basically a random subset of the model. Swapping large parts of the model to "disk" from token-to-token is too slow. - Comparatively you don't need as much memory bandwidth. Only a small fraction of the weights are used per token. In the case of DeepSeek v3 37B / 671B are active. So only ~5% of the weights are moved to GPU cache / register for each token. (SVG animation made with the help of DeepSeek V2 1210 + MLX on an M2 Ultra)

Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in memory (not just the active parameters). For an MOE at each token you access basically a random subset of the model. Swapping large parts of the model to "disk" from token-to-token is too slow. - Comparatively you don't need as much memory bandwidth. Only a small fraction of the weights are used per token. In the case of DeepSeek v3 37B / 671B are active. So only ~5% of the weights are moved to GPU cache / register for each token. (SVG animation made with the help of DeepSeek V2 1210 + MLX on an M2 Ultra)

27,452 Aufrufe

Gotta love MoEs on Apple silicon with MLX. Kimi's new 16B (3B active) Moonshot model runs very nicely on an M4 Max. As good or better than some of the best dense 7Bs and 1.5x faster inference (154 toks/sec!):

Gotta love MoEs on Apple silicon with MLX. Kimi's new 16B (3B active) Moonshot model runs very nicely on an M4 Max. As good or better than some of the best dense 7Bs and 1.5x faster inference (154 toks/sec!):

23,209 Aufrufe

Latest mlx-lm has faster and lower memory prompt processing! Thanks to causal fused attention from Jagrit Digani 7B 4-bit Mistral v3 can do ~30,000 tokens under a minute on my M4 Max laptop and only needs 8.5GB:

Latest mlx-lm has faster and lower memory prompt processing! Thanks to causal fused attention from Jagrit Digani 7B 4-bit Mistral v3 can do ~30,000 tokens under a minute on my M4 Max laptop and only needs 8.5GB:

22,156 Aufrufe

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

30,441 Aufrufe

DeepSeek-Prover (4-bit 7B) running at 114 toks/sec in MLX LM on an M2 Ultra Thanks to for the port!

DeepSeek-Prover (4-bit 7B) running at 114 toks/sec in MLX LM on an M2 Ultra Thanks to for the port!

16,077 Aufrufe

Videos