Awni Hannun

@awnihannun • 44,820 subscribers

ow knee

Shorts

DeepSeek R1 distilled to Qwen 1.5B easily runs on my iPhone 16 with MLX swift. Here's the 4-bit model reasoning entirely on device at almost 60 toks/sec:

1,138,564 views

Llama 3.2 1B in 4-bit runs at ~60 toks/sec with MLX Swift on my iPhone 15 pro. It's quite good and easily runs on-device:

492,305 views

mlx_lm.server + Qwen3 Coder Next 6bit + OpenCode + M3 Ultra = a pretty capable and very fast local coding setup. Not sped up:

66,619 views

Managed to get Ling Mini 16B (1.4B active) running on my iPhone Air. It runs very fast with MLX. It's a DWQ of Ling Mini quantized to 3 bits-per-weight. A 16B model running on an Air at this speed is pretty awesome:

92,422 views

We (Claude and I) made a new package: voxmlx It's an MLX implementation of Mistral's Voxtral mini realtime speech recognition model. It supports streaming audio and runs pretty fast on a laptop. To use it, simply: `uvx voxmlx` Also I did not write a single line of code for this package. Every line of code was written by Claude Code. More on that in thread.

32,410 views

Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).

35,078 views

No-one: But can you do 16 generations on your M4 laptop simultaneously? MLX LM:

46,707 views

Cool video (sped-up) running Mixtral 8x22B on a M3 Max (h/t PSA since I get this question a lot: asitop is the monitoring tool on the right (pip install asitop) glances is on the bottom (pip install glances)

70,804 views

$Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in memory (not just the active parameters). For an MOE at each token you access basically a random subset of the model. Swapping large parts of the model to "disk" from token-to-token is too slow. - Comparatively you don't need as much memory bandwidth. Only a small fraction of the weights are used per token. In the case of DeepSeek v3 37B / 671B are active. So only ~5% of the weights are moved to GPU cache / register for each token. (SVG animation made with the help of DeepSeek V2 1210 + MLX on an M2 Ultra)$

Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in memory (not just the active parameters). For an MOE at each token you access basically a random subset of the model. Swapping large parts of the model to "disk" from token-to-token is too slow. - Comparatively you don't need as much memory bandwidth. Only a small fraction of the weights are used per token. In the case of DeepSeek v3 37B / 671B are active. So only ~5% of the weights are moved to GPU cache / register for each token. (SVG animation made with the help of DeepSeek V2 1210 + MLX on an M2 Ultra)

27,452 views

Gotta love MoEs on Apple silicon with MLX. Kimi's new 16B (3B active) Moonshot model runs very nicely on an M4 Max. As good or better than some of the best dense 7Bs and 1.5x faster inference (154 toks/sec!):

23,209 views

Latest mlx-lm has faster and lower memory prompt processing! Thanks to causal fused attention from Jagrit Digani 7B 4-bit Mistral v3 can do ~30,000 tokens under a minute on my M4 Max laptop and only needs 8.5GB:

22,156 views

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

30,441 views

DeepSeek-Prover (4-bit 7B) running at 114 toks/sec in MLX LM on an M2 Ultra Thanks to for the port!

16,077 views

Videos

LIVE

1.2k

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Streaming Now

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

HD live stream

Exclusive private shows

1.2k viewers online

Current Status

Live

Private Show

Join now for exclusive access

Free preview available • Premium content

Live Cam

Awni Hannun

Shorts

DeepSeek R1 distilled to Qwen 1.5B easily runs on my iPhone 16 with MLX swift. Here's the 4-bit model reasoning entirely on device at almost 60 toks/sec:

Llama 3.2 1B in 4-bit runs at ~60 toks/sec with MLX Swift on my iPhone 15 pro. It's quite good and easily runs on-device:

mlx_lm.server + Qwen3 Coder Next 6bit + OpenCode + M3 Ultra = a pretty capable and very fast local coding setup. Not sped up:

Managed to get Ling Mini 16B (1.4B active) running on my iPhone Air. It runs very fast with MLX. It's a DWQ of Ling Mini quantized to 3 bits-per-weight. A 16B model running on an Air at this speed is pretty awesome:

Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).

No-one: But can you do 16 generations on your M4 laptop simultaneously? MLX LM:

Cool video (sped-up) running Mixtral 8x22B on a M3 Max (h/t PSA since I get this question a lot: asitop is the monitoring tool on the right (pip install asitop) glances is on the bottom (pip install glances)

Gotta love MoEs on Apple silicon with MLX. Kimi's new 16B (3B active) Moonshot model runs very nicely on an M4 Max. As good or better than some of the best dense 7Bs and 1.5x faster inference (154 toks/sec!):

Latest mlx-lm has faster and lower memory prompt processing! Thanks to causal fused attention from Jagrit Digani 7B 4-bit Mistral v3 can do ~30,000 tokens under a minute on my M4 Max laptop and only needs 8.5GB:

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

DeepSeek-Prover (4-bit 7B) running at 114 toks/sec in MLX LM on an M2 Ultra Thanks to for the port!

Videos

Watch Anya Live

The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality! The model was quantization aware trained (qat) at int4. Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm:

DeepSeek R1 (the full 680B model) runs nicely in higher quality 4-bit on 3 M2 Ultras with MLX. Asked it a coding question and it thought for ~2k tokens and generated 3500 tokens overall:

Kimi K2.5 1T runs on 2 M3 Ultras with mlx-lm in it's native precision. It's actually quite usable. Here it's making a space invaders game. Generated 3856 tokens at 21.9 tok/sec using 350GB per machine. Thanks to Tarjei Mandt for the port.

DeepSeek R1 671B running on 2 M2 Ultras faster than reading speed. Getting close to open-source O1, at home, on consumer hardware. With mlx.distributed and mlx-lm, 3-bit quantization (~4 bpw)

GLM-4.7 runs quite well on an M3 Ultra with mlx-lm, even at a near lossless precision (6-bit here). It generated the best space invaders game I've seen yet for a local model (even included sound effects!). Generated 6600 tokens and ran at 16 tok/s.

GLM 4.7 Flash is supported in mlx-lm 0.30.3 (h/t Ivan Fioravanti ᯅ) The 4-bit runs fast (43 tok/s generation, ~800 tok/s prefill) on a base M5 32GB laptop.

Next level: QLoRA fine-tuning 4-bit Llama 3 8B on iPhone 15 pro. Incoming (Q)LoRA MLX Swift example by David Koski: works with lot's of models (Mistral, Gemma, Phi-2, etc)

Running Qwen3 8B thinking on an iPhone Air with MLX. The model is quantized to 4-bit and runs pretty well.

The new Kimi K2 1T model (4-bit quant) runs on 2 512GB M3 Ultras with mlx-lm and mx.distributed. 1 trillion params, at a speed that's actually quite usable:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at &gt;107 tokens/sec:

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

MiniMax M2.1 in 4-bit cruises on an M3 Ultra with mlx-lm. Generated a space invaders game using 5098 tokens at 47.2 tok/sec:

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

The new Deep Seek V3 0324 in 4-bit runs at &gt; 20 toks/sec on a 512GB M3 Ultra with mlx-lm!

First results are in. Llama 4 Maverick 17B active / 400B total is blazing fast with MLX on an M3 Ultra. Here is the 4-bit model generating 1100 tokens at 50 tok/sec:

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here's the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.

QwQ-32B evals on par with Deep Seek R1 680B but runs fast on a laptop. Delivery accepted. Here it is running nicely on a M4 Max with MLX. A snippet of its 8k token long thought process:

Qwen3 235B MoE (22B active) runs so fast on an M2 Ultra with mlx-lm. - 4-bit model uses ~132GB - Generated 580 tokens at ~28 toks/sec

Nemotron 3 Nano runs nicely with mlx-lm on an M4 Max. Could be a great model for local use on Mac: MoE + hybrid attention make it fast even for very long context. Generating in realtime with 4-bit model:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm!