Awni Hannun

@awnihannun • 44,906 subscribers

Shorts

1,139,656 views

492,428 views

66,668 views

92,422 views

33,896 views

35,078 views

46,767 views

70,804 views

23,209 views

30,441 views

16,079 views

Videos

sweetdream.ai

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Private Show

Join now for exclusive access

Free preview available • Premium content

500,804 views • 8 months ago

997,198 views • 1 year ago

866,067 views • 1 year ago

241,472 views • 5 months ago

237,149 views • 6 months ago

581,723 views • 2 years ago

141,194 views • 6 months ago

215,529 views • 10 months ago

238,057 views • 1 year ago

196,508 views • 11 months ago

186,641 views • 1 year ago

95,050 views • 6 months ago

93,826 views • 6 months ago

168,891 views • 1 year ago

60,599 views • 5 months ago

149,855 views • 1 year ago

132,388 views • 1 year ago

117,763 views • 1 year ago

48,673 views • 5 months ago

68,539 views • 9 months ago

Live Cam

Awni Hannun

Shorts

DeepSeek R1 distilled to Qwen 1.5B easily runs on my iPhone 16 with MLX swift. Here's the 4-bit model reasoning entirely on device at almost 60 toks/sec:

Llama 3.2 1B in 4-bit runs at ~60 toks/sec with MLX Swift on my iPhone 15 pro. It's quite good and easily runs on-device:

mlx_lm.server + Qwen3 Coder Next 6bit + OpenCode + M3 Ultra = a pretty capable and very fast local coding setup. Not sped up:

Managed to get Ling Mini 16B (1.4B active) running on my iPhone Air. It runs very fast with MLX. It's a DWQ of Ling Mini quantized to 3 bits-per-weight. A 16B model running on an Air at this speed is pretty awesome:

Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).

No-one: But can you do 16 generations on your M4 laptop simultaneously? MLX LM:

Cool video (sped-up) running Mixtral 8x22B on a M3 Max (h/t PSA since I get this question a lot: asitop is the monitoring tool on the right (pip install asitop) glances is on the bottom (pip install glances)

Gotta love MoEs on Apple silicon with MLX. Kimi's new 16B (3B active) Moonshot model runs very nicely on an M4 Max. As good or better than some of the best dense 7Bs and 1.5x faster inference (154 toks/sec!):

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

DeepSeek-Prover (4-bit 7B) running at 114 toks/sec in MLX LM on an M2 Ultra Thanks to for the port!

Videos

Watch Anya Live

The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality! The model was quantization aware trained (qat) at int4. Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm:

DeepSeek R1 (the full 680B model) runs nicely in higher quality 4-bit on 3 M2 Ultras with MLX. Asked it a coding question and it thought for ~2k tokens and generated 3500 tokens overall:

DeepSeek R1 671B running on 2 M2 Ultras faster than reading speed. Getting close to open-source O1, at home, on consumer hardware. With mlx.distributed and mlx-lm, 3-bit quantization (~4 bpw)

Kimi K2.5 1T runs on 2 M3 Ultras with mlx-lm in it's native precision. It's actually quite usable. Here it's making a space invaders game. Generated 3856 tokens at 21.9 tok/sec using 350GB per machine. Thanks to Tarjei Mandt for the port.

GLM-4.7 runs quite well on an M3 Ultra with mlx-lm, even at a near lossless precision (6-bit here). It generated the best space invaders game I've seen yet for a local model (even included sound effects!). Generated 6600 tokens and ran at 16 tok/s.

Next level: QLoRA fine-tuning 4-bit Llama 3 8B on iPhone 15 pro. Incoming (Q)LoRA MLX Swift example by David Koski: works with lot's of models (Mistral, Gemma, Phi-2, etc)

GLM 4.7 Flash is supported in mlx-lm 0.30.3 (h/t Ivan Fioravanti ᯅ) The 4-bit runs fast (43 tok/s generation, ~800 tok/s prefill) on a base M5 32GB laptop.

Running Qwen3 8B thinking on an iPhone Air with MLX. The model is quantized to 4-bit and runs pretty well.

The new Kimi K2 1T model (4-bit quant) runs on 2 512GB M3 Ultras with mlx-lm and mx.distributed. 1 trillion params, at a speed that's actually quite usable:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at &gt;107 tokens/sec:

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

MiniMax M2.1 in 4-bit cruises on an M3 Ultra with mlx-lm. Generated a space invaders game using 5098 tokens at 47.2 tok/sec:

The new Deep Seek V3 0324 in 4-bit runs at &gt; 20 toks/sec on a 512GB M3 Ultra with mlx-lm!

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

First results are in. Llama 4 Maverick 17B active / 400B total is blazing fast with MLX on an M3 Ultra. Here is the 4-bit model generating 1100 tokens at 50 tok/sec:

QwQ-32B evals on par with Deep Seek R1 680B but runs fast on a laptop. Delivery accepted. Here it is running nicely on a M4 Max with MLX. A snippet of its 8k token long thought process:

Qwen3 235B MoE (22B active) runs so fast on an M2 Ultra with mlx-lm. - 4-bit model uses ~132GB - Generated 580 tokens at ~28 toks/sec

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here's the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm!