Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Qwen 32B (4-bit) generates at >40 toks/sec on an M4 Max with assisted decoding and Qwen 0.5B as the draft model. Coming soon to mlx-lm. Compare regular decoding (left) to assisted decoding (right):

Awni Hannun

44,793 subscribers

50,353 views • 1 year ago •via X (Twitter)

Education Science & Technology News & Politics

Anya Rossi• Live Now

Private livecam show

11 Comments

N8 Programs1 year ago

WOW! How does this differ from my speculative decoding impl - what makes it so much faster? Cause this is awesome.

Lab4crypto1 year ago

🚀 Don't gamble with your portfolio! Use our advanced hybrid quant risk tool using on/off-chain data daily and make informed decisions. 📈 Acess to 1000+ charts for your crypto journey. 📚Receive free weekly quant analysis. 📊+21 projects supported. 🏗️ Beginners and experts.

Ivan Fioravanti ᯅ1 year ago

Super fast! 💪

Tay1 year ago

Assisted decoding?

Awni Hannun1 year ago

A small draft model is used to generate tokens which are then accepted or rejected by the main model depending on certain criteria. In this case the criteria is exact match.

Caleb1 year ago

Super cool 🤩

DS1 year ago

Apple intelligence so far: "siri can set an alarm even faster now!"

Mark Lord1 year ago

Try with the 2b model, set draft tokens to 31, and modify the wording of the prompt to “Write me a quick sort in C++. Don’t give me a preamble, just immediately write the code.” If it’s anything like my tests, I reckon you’ll squeeze a few more tokens/second 😁

SM1 year ago

Impressive! But do you think one can run diffusion models inference on phones?

Sohaib1 year ago

Awesome!

Unclecode (Hossein)1 year ago

Interesting, It makes sense to be faster due to assusted coding definition, however did you try any eval? I wonder what are unpredictable effect of such decoding

Related Videos

Having some fun with the new speculative generation feature in LM Studio: -MLX 4-bit Qwen 32B/0.5B draft runs a lot faster for coding tasks than the 32B model alone -Nice to visualizase the draft tokens generated:

Having some fun with the new speculative generation feature in LM Studio: -MLX 4-bit Qwen 32B/0.5B draft runs a lot faster for coding tasks than the 32B model alone -Nice to visualizase the draft tokens generated:

Awni Hannun

12,642 views • 1 year ago

Qwen 2.5 Coder Q4 M4 Max Inference test. Apple MLX vs Ollama: - MLX: 23.97 toks/sec 🥇🔥 - Ollama: 18.33 toks/sec 🥈 Here a video to show results

Qwen 2.5 Coder Q4 M4 Max Inference test. Apple MLX vs Ollama: - MLX: 23.97 toks/sec 🥇🔥 - Ollama: 18.33 toks/sec 🥈 Here a video to show results

Ivan Fioravanti ᯅ

34,881 views • 1 year ago

Made a demo to do test-time-scaling with mlx-lm and a R1-based reasoning model. Same idea as S1: - To force a response, swap "Wait" for "</think>" - To think more, swap "</think>" for "Wait" Runs fast with 4-bit Qwen 32B on an M3 max:

Made a demo to do test-time-scaling with mlx-lm and a R1-based reasoning model. Same idea as S1: - To force a response, swap "Wait" for "</think>" - To think more, swap "</think>" for "Wait" Runs fast with 4-bit Qwen 32B on an M3 max:

Awni Hannun

59,790 views • 1 year ago

Casual reminder - you can run o1-mini level or better models on your laptop at home. Here's DeepSeek R1 distilled to Qwen 32B running pretty quick on an M4 Max with the MLX back-end in LM Studio

Casual reminder - you can run o1-mini level or better models on your laptop at home. Here's DeepSeek R1 distilled to Qwen 32B running pretty quick on an M4 Max with the MLX back-end in LM Studio

Awni Hannun

55,342 views • 1 year ago

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Awni Hannun

196,482 views • 10 months ago

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

Awni Hannun

186,641 views • 11 months ago

Llama 3.3 70B 4-bit running on a 128GB M4 Max with MLX LM (~12 toks/sec) with mactop on the right. "Good enough", local, private and open models are now more accessible than ever!

Llama 3.3 70B 4-bit running on a 128GB M4 Max with MLX LM (~12 toks/sec) with mactop on the right. "Good enough", local, private and open models are now more accessible than ever!

Ivan Fioravanti ᯅ

43,929 views • 1 year ago

Llama 3.3 70B 4-bit runs nicely on a 64GB M3 Max with in MLX LM (~10 toks/sec). Would be even faster on an M4 Max. Yesterday's server-only 405B is today's laptop 70B:

Llama 3.3 70B 4-bit runs nicely on a 64GB M3 Max with in MLX LM (~10 toks/sec). Would be even faster on an M4 Max. Yesterday's server-only 405B is today's laptop 70B:

Awni Hannun

110,375 views • 1 year ago

M4 Mac AI Coding Cluster Uses EXO Labs to run LLMs (here Qwen 2.5 Coder 32B at 18 tok/sec) distributed across 4 M4 Mac Minis (Thunderbolt 5 80Gbps) and a MacBook Pro M4 Max. Local alternative to Cursor (benchmark comparison soon).

M4 Mac AI Coding Cluster Uses EXO Labs to run LLMs (here Qwen 2.5 Coder 32B at 18 tok/sec) distributed across 4 M4 Mac Minis (Thunderbolt 5 80Gbps) and a MacBook Pro M4 Max. Local alternative to Cursor (benchmark comparison soon).

Alex Cheema

517,325 views • 1 year ago

The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm!

The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm!

Awni Hannun

168,842 views • 1 year ago

Qwen3 235B MoE (22B active) runs so fast on an M2 Ultra with mlx-lm. - 4-bit model uses ~132GB - Generated 580 tokens at ~28 toks/sec

Qwen3 235B MoE (22B active) runs so fast on an M2 Ultra with mlx-lm. - 4-bit model uses ~132GB - Generated 580 tokens at ~28 toks/sec

Awni Hannun

117,763 views • 1 year ago

Pretty cool that with the new Qwen 2.5 models you can ask questions / generate using a reasonably sized code-base as context, all running on a laptop with mlx-lm. The 7B runs pretty fast on an M4 Max using the mlx-lm code base (~16k lines) as context:

Pretty cool that with the new Qwen 2.5 models you can ask questions / generate using a reasonably sized code-base as context, all running on a laptop with mlx-lm. The 7B runs pretty fast on an M4 Max using the mlx-lm code base (~16k lines) as context:

Awni Hannun

27,442 views • 1 year ago

DeepSeek R1 (the 680B MOE) is ~20% faster in the latest mlx / mlx-lm. 4-bit model on 3 M2 Ultras generates 4k tokens at a respectable 15 toks/sec. Plus some QoL improvements: - Only downloads the local shard (much faster startup) - Distributed launcher ships with MLX

DeepSeek R1 (the 680B MOE) is ~20% faster in the latest mlx / mlx-lm. 4-bit model on 3 M2 Ultras generates 4k tokens at a respectable 15 toks/sec. Plus some QoL improvements: - Only downloads the local shard (much faster startup) - Distributed launcher ships with MLX

Awni Hannun

86,921 views • 1 year ago

The new Qwen 3.5 by Qwen running on-device on iPhone 17 Pro. Qwen 3.5 beats models 4 times its size, has strong visual understanding, and can toggle reasoning on or off. The 2B 6-bit model here is running with MLX optimized for Apple Silicon.

The new Qwen 3.5 by Qwen running on-device on iPhone 17 Pro. Qwen 3.5 beats models 4 times its size, has strong visual understanding, and can toggle reasoning on or off. The 2B 6-bit model here is running with MLX optimized for Apple Silicon.

Adrien Grondin

3,553,263 views • 3 months ago

oMLX last version seems having some issues in decoding. 🧐 Here a test with Qwen3.6-35B-A3B-MLX-6bit on M5 Max using. 🥇 LM Studio MLX 1.8.5 → 100.9 toks/s 🥈 mlx-vlm 0.6.2 → 100.1 toks/s 🥉 oMLX 0.4.2 dev3 → 58.7 toks/s 👀 Avg Gen TPS: oMLX 58.7 → LM Studio 100.9 (+71.8%) I have to thank pymike00 that raised this oMLX issue after seeing my video on using it with Codex. I bet there is a bug in oMLX chat and server at the moment, because internal benchmarks are ok, video attached. I bet Jun Kim will fix this soon 💪

oMLX last version seems having some issues in decoding. 🧐 Here a test with Qwen3.6-35B-A3B-MLX-6bit on M5 Max using. 🥇 LM Studio MLX 1.8.5 → 100.9 toks/s 🥈 mlx-vlm 0.6.2 → 100.1 toks/s 🥉 oMLX 0.4.2 dev3 → 58.7 toks/s 👀 Avg Gen TPS: oMLX 58.7 → LM Studio 100.9 (+71.8%) I have to thank pymike00 that raised this oMLX issue after seeing my video on using it with Codex. I bet there is a bug in oMLX chat and server at the moment, because internal benchmarks are ok, video attached. I bet Jun Kim will fix this soon 💪

Ivan Fioravanti ᯅ

19,756 views • 19 days ago

Nemotron 3 Nano runs nicely with mlx-lm on an M4 Max. Could be a great model for local use on Mac: MoE + hybrid attention make it fast even for very long context. Generating in realtime with 4-bit model:

Nemotron 3 Nano runs nicely with mlx-lm on an M4 Max. Could be a great model for local use on Mac: MoE + hybrid attention make it fast even for very long context. Generating in realtime with 4-bit model:

Awni Hannun

51,029 views • 6 months ago

The latest Qwen 3 VL by Qwen running on iPhone 17 Pro with MLX Qwen 3 VL brings upgraded visual understanding, recognition, and OCR capabilities without sacrificing text performance like previous models The 4B model here is close to Qwen 2.5 VL 72B in many benchmarks

The latest Qwen 3 VL by Qwen running on iPhone 17 Pro with MLX Qwen 3 VL brings upgraded visual understanding, recognition, and OCR capabilities without sacrificing text performance like previous models The 4B model here is close to Qwen 2.5 VL 72B in many benchmarks

Adrien Grondin

109,700 views • 8 months ago

Geoscan 4 live decoding

Geoscan 4 live decoding

supertrack

24,091 views • 4 months ago