Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Just for fun, here's what 32 simultaneous long-context generations with Qwen3 Next 80B looks like on an M3 Ultra. Using the new batch generation in mlx-lm. Context size for each is about 5k tokens:

Awni Hannun

35,290 subscribers

50,272 Aufrufe • vor 9 Monaten •via X (Twitter)

Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

MiniMax M2.1 in 4-bit cruises on an M3 Ultra with mlx-lm. Generated a space invaders game using 5098 tokens at 47.2 tok/sec:

MiniMax M2.1 in 4-bit cruises on an M3 Ultra with mlx-lm. Generated a space invaders game using 5098 tokens at 47.2 tok/sec:

Awni Hannun

93,826 Aufrufe • vor 6 Monaten

Pretty cool that with the new Qwen 2.5 models you can ask questions / generate using a reasonably sized code-base as context, all running on a laptop with mlx-lm. The 7B runs pretty fast on an M4 Max using the mlx-lm code base (~16k lines) as context:

Pretty cool that with the new Qwen 2.5 models you can ask questions / generate using a reasonably sized code-base as context, all running on a laptop with mlx-lm. The 7B runs pretty fast on an M4 Max using the mlx-lm code base (~16k lines) as context:

Awni Hannun

27,442 Aufrufe • vor 1 Jahr

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here's the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here's the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.

Awni Hannun

48,673 Aufrufe • vor 4 Monaten

MLX GLM 5.2 Distributed on two M3 Ultra 512GB 🔥 One M3 Ultra: 18.8 tokens/sec Two M3 Ultra: 23.4 tokens/sec Context: - PR by Pedro Cuenca is still open and probably there is room for improvement: - basic generation test to measure decoding performance here, I will do a full context benchmarking once PR is more mature - nvfp4 quantization used - Video alternates standard speed and x20, with one Mac first and distributed later. Enjoy! 🙌🏻

MLX GLM 5.2 Distributed on two M3 Ultra 512GB 🔥 One M3 Ultra: 18.8 tokens/sec Two M3 Ultra: 23.4 tokens/sec Context: - PR by Pedro Cuenca is still open and probably there is room for improvement: - basic generation test to measure decoding performance here, I will do a full context benchmarking once PR is more mature - nvfp4 quantization used - Video alternates standard speed and x20, with one Mac first and distributed later. Enjoy! 🙌🏻

Ivan Fioravanti ᯅ

86,488 Aufrufe • vor 6 Tagen

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

Awni Hannun

186,641 Aufrufe • vor 11 Monaten

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

Angelos Katharopoulos

66,128 Aufrufe • vor 2 Monaten

Qwen3 235B MoE (22B active) runs so fast on an M2 Ultra with mlx-lm. - 4-bit model uses ~132GB - Generated 580 tokens at ~28 toks/sec

Qwen3 235B MoE (22B active) runs so fast on an M2 Ultra with mlx-lm. - 4-bit model uses ~132GB - Generated 580 tokens at ~28 toks/sec

Awni Hannun

117,763 Aufrufe • vor 1 Jahr

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Awni Hannun

196,482 Aufrufe • vor 11 Monaten

DeepSeek-R1-0528-5bit on MLX pushing M3 Ultra 512GB to its limits! 501GB used mem visibile on mactop in the video! Context: 4K tokens Prompt: 190.29 t/s Gen: 11.37 t/s Peak Mem: 487.48 GB! THIS IS APPLE MLX!

DeepSeek-R1-0528-5bit on MLX pushing M3 Ultra 512GB to its limits! 501GB used mem visibile on mactop in the video! Context: 4K tokens Prompt: 190.29 t/s Gen: 11.37 t/s Peak Mem: 487.48 GB! THIS IS APPLE MLX!

Ivan Fioravanti ᯅ

102,949 Aufrufe • vor 1 Jahr

GLM 4.6 runs quite fast on an M3 Ultra with mlx-lm even at higher precision. Pretty remarkable that it benchmarks competitive to the just-released Sonnet 4.5. Hope those benchmarks hold-up in day-to-day use. Here's a run using 5.5 bpw quantized model, generating 5.3k tokens at 17+ tok/sec using 244 GB. What prompts should I test?

GLM 4.6 runs quite fast on an M3 Ultra with mlx-lm even at higher precision. Pretty remarkable that it benchmarks competitive to the just-released Sonnet 4.5. Hope those benchmarks hold-up in day-to-day use. Here's a run using 5.5 bpw quantized model, generating 5.3k tokens at 17+ tok/sec using 244 GB. What prompts should I test?

Awni Hannun

68,539 Aufrufe • vor 9 Monaten

The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm!

The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm!

Awni Hannun

168,842 Aufrufe • vor 1 Jahr

Nemotron 3 Nano runs nicely with mlx-lm on an M4 Max. Could be a great model for local use on Mac: MoE + hybrid attention make it fast even for very long context. Generating in realtime with 4-bit model:

Nemotron 3 Nano runs nicely with mlx-lm on an M4 Max. Could be a great model for local use on Mac: MoE + hybrid attention make it fast even for very long context. Generating in realtime with 4-bit model:

Awni Hannun

51,029 Aufrufe • vor 6 Monaten

This is MiniMax-M2.5 MLX running in LM Studio on an Apple Mac Studio M3 Ultra 512GB. Fast enough out of the box for hosting OpenClaw, n8n workflows, and Open WebUI for the team.

This is MiniMax-M2.5 MLX running in LM Studio on an Apple Mac Studio M3 Ultra 512GB. Fast enough out of the box for hosting OpenClaw, n8n workflows, and Open WebUI for the team.

Patrick J Kennedy

73,547 Aufrufe • vor 4 Monaten

LM Studio is the most popular way to run open-source LLMs on your own hardware. Your Hermes Agent now runs natively on LM Studio: auto-discovering your models, loading them on demand with the right context size, and using the right reasoning level for each model.

LM Studio is the most popular way to run open-source LLMs on your own hardware. Your Hermes Agent now runs natively on LM Studio: auto-discovering your models, loading them on demand with the right context size, and using the right reasoning level for each model.

Nous Research

185,167 Aufrufe • vor 1 Monat

For context, this is what the march usually looks like.

For context, this is what the march usually looks like.

inqilāb

1,620,157 Aufrufe • vor 2 Jahren

Having some fun with the new speculative generation feature in LM Studio: -MLX 4-bit Qwen 32B/0.5B draft runs a lot faster for coding tasks than the 32B model alone -Nice to visualizase the draft tokens generated:

Having some fun with the new speculative generation feature in LM Studio: -MLX 4-bit Qwen 32B/0.5B draft runs a lot faster for coding tasks than the 32B model alone -Nice to visualizase the draft tokens generated:

Awni Hannun

12,642 Aufrufe • vor 1 Jahr

🔥1-min Interactive Video Generation with Multimodal Control🔥 Towards *long-context world model*, #LongVie is an end-to-end autoregressive framework for controllable ultra-long video generation - Page: - Paper: . Thanks AK

🔥1-min Interactive Video Generation with Multimodal Control🔥 Towards long-context world model, #LongVie is an end-to-end autoregressive framework for controllable ultra-long video generation - Page: - Paper: . Thanks AK

Ziwei Liu

14,185 Aufrufe • vor 10 Monaten

Qwen3-Next (thinking & non-thinking) are now live in BF16 at Hyperbolic! Qwen3-Next is a huge efficiency leap: - 80B MoE with just 3B active params - 10x cheaper to train vs Qwen3-32B - 10x inference throughput for >32K tokens Proud to be a launch partner with Qwen - kudos to this amazing team for keeping pushing open-source AI forward. We’re the first to serve Qwen3-Next on Hugging Face. Give it a try!

Qwen3-Next (thinking & non-thinking) are now live in BF16 at Hyperbolic! Qwen3-Next is a huge efficiency leap: - 80B MoE with just 3B active params - 10x cheaper to train vs Qwen3-32B - 10x inference throughput for >32K tokens Proud to be a launch partner with Qwen - kudos to this amazing team for keeping pushing open-source AI forward. We’re the first to serve Qwen3-Next on Hugging Face. Give it a try!

Yuchen Jin

84,342 Aufrufe • vor 9 Monaten

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

Awni Hannun

60,446 Aufrufe • vor 4 Monaten

First results are in. Llama 4 Maverick 17B active / 400B total is blazing fast with MLX on an M3 Ultra. Here is the 4-bit model generating 1100 tokens at 50 tok/sec:

First results are in. Llama 4 Maverick 17B active / 400B total is blazing fast with MLX on an M3 Ultra. Here is the 4-bit model generating 1100 tokens at 50 tok/sec:

Awni Hannun

149,855 Aufrufe • vor 1 Jahr