正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Running GLM-4.7-Flash with OpenCode locally on M4 Max MacBook Pro. 4-bit model runs at 82 tok/sec. Prefill will get ~4x faster with M5 Max MacBook Pro (~28 Jan). EXO will also support disaggregating prefill and decode across devices, e.g. DGX Spark.

Alex Cheema

49,036 subscribers

127,646 次观看 • 4 个月前 •via X (Twitter)

教育科学技术

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Alex Cheema

56,555 次观看 • 4 个月前

GLM 4.7 Flash is supported in mlx-lm 0.30.3 (h/t Ivan Fioravanti ᯅ) The 4-bit runs fast (43 tok/s generation, ~800 tok/s prefill) on a base M5 32GB laptop.

GLM 4.7 Flash is supported in mlx-lm 0.30.3 (h/t Ivan Fioravanti ᯅ) The 4-bit runs fast (43 tok/s generation, ~800 tok/s prefill) on a base M5 32GB laptop.

Awni Hannun

141,194 次观看 • 4 个月前

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Alex Cheema

62,144 次观看 • 4 个月前

GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45.6 tok/s decode (gen) - 1340 tok/s prefill I could get 2x decode if I limit to 64k context (100 tok/s) In this video it operates Figma (:

GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45.6 tok/s decode (gen) - 1340 tok/s prefill I could get 2x decode if I limit to 64k context (100 tok/s) In this video it operates Figma (:

0xSero

74,772 次观看 • 1 个月前

M5 Max MacBook Pro performance is no joke. Up to 4.8× faster 3D rendering vs M1 Max and ~1.4× over M4 Max, plus up to 5.4× faster effects rendering in DaVinci Resolve. Can’t wait to pair this with M5 Vision Pro.

M5 Max MacBook Pro performance is no joke. Up to 4.8× faster 3D rendering vs M1 Max and ~1.4× over M4 Max, plus up to 5.4× faster effects rendering in DaVinci Resolve. Can’t wait to pair this with M5 Vision Pro.

Justin Ryan ᯅ

13,634 次观看 • 3 个月前

M4 Mac AI Coding Cluster Uses EXO Labs to run LLMs (here Qwen 2.5 Coder 32B at 18 tok/sec) distributed across 4 M4 Mac Minis (Thunderbolt 5 80Gbps) and a MacBook Pro M4 Max. Local alternative to Cursor (benchmark comparison soon).

M4 Mac AI Coding Cluster Uses EXO Labs to run LLMs (here Qwen 2.5 Coder 32B at 18 tok/sec) distributed across 4 M4 Mac Minis (Thunderbolt 5 80Gbps) and a MacBook Pro M4 Max. Local alternative to Cursor (benchmark comparison soon).

Alex Cheema

517,028 次观看 • 1 年前

DeepSeek v4 PRO running via SSD streaming on my 128GB MacBook m5 max. 1.6 trillion parameters.

DeepSeek v4 PRO running via SSD streaming on my 128GB MacBook m5 max. 1.6 trillion parameters.

antirez

265,321 次观看 • 13 天前

WIP: First attempt to speed up prefill for Flash-MoE. Original repo did token-by-token without streamed experts. Added: Batched linear attention + batched full attention (Flash Attention style) with custom Metal kernels. Without experts: 6.2x faster prefill (11 -> 68 tok/s) With experts at full-attn layers only: 1.9x faster (11 -> 20.5 tok/s) — same output quality Qwen3.5-397B, 4-bit, 209GB, M5 Max 128GB 1/3

WIP: First attempt to speed up prefill for Flash-MoE. Original repo did token-by-token without streamed experts. Added: Batched linear attention + batched full attention (Flash Attention style) with custom Metal kernels. Without experts: 6.2x faster prefill (11 -> 68 tok/s) With experts at full-attn layers only: 1.9x faster (11 -> 20.5 tok/s) — same output quality Qwen3.5-397B, 4-bit, 209GB, M5 Max 128GB 1/3

Anemll

19,560 次观看 • 2 个月前

Le nouveau MacBook Pro M5 Pro/Max a été annoncé ! → Je me le prends ou pas ? 🤔 (je suis sous MacBook Pro M1 Max)

Le nouveau MacBook Pro M5 Pro/Max a été annoncé ! → Je me le prends ou pas ? 🤔 (je suis sous MacBook Pro M1 Max)

Basti Ui ✌️

38,118 次观看 • 3 个月前

Running Llama-3-70B at home with exo intern Combines the compute of all these devices to make one big GPU: - iPhone 15 Pro Max - iPad Pro M4 - Galaxy S24 Ultra - MacBook Pro M2 and M3 Pro - 2 x MSI NVIDIA GeForce RTX 4090 SUPRIM Code is open source 👇

Running Llama-3-70B at home with exo intern Combines the compute of all these devices to make one big GPU: - iPhone 15 Pro Max - iPad Pro M4 - Galaxy S24 Ultra - MacBook Pro M2 and M3 Pro - 2 x MSI NVIDIA GeForce RTX 4090 SUPRIM Code is open source 👇

Alex Cheema

197,476 次观看 • 1 年前

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Awni Hannun

196,482 次观看 • 10 个月前

I did it! It works! Using GLM-4.7-4bit with mlx_lm.server and opencode to fix real code locally! 🔥 Here single M3 Ultra 512GB, nex step phase will be 2 using Tensor Parallelism and then apply same changes to exo. Prefill is slow on a single machine, but generation is good.

I did it! It works! Using GLM-4.7-4bit with mlx_lm.server and opencode to fix real code locally! 🔥 Here single M3 Ultra 512GB, nex step phase will be 2 using Tensor Parallelism and then apply same changes to exo. Prefill is slow on a single machine, but generation is good.

Ivan Fioravanti ᯅ

44,000 次观看 • 5 个月前

mlx_lm server worked flawlessly with Qwen3.6-35B-A3B-8bit, and on M5 Max, the much faster prefill, gives a very pleasant coding experience. Here two OpenCode instances working on mlx_lm and mlx_vlm source code. Video in normal speed.

mlx_lm server worked flawlessly with Qwen3.6-35B-A3B-8bit, and on M5 Max, the much faster prefill, gives a very pleasant coding experience. Here two OpenCode instances working on mlx_lm and mlx_vlm source code. Video in normal speed.

Ivan Fioravanti ᯅ

19,631 次观看 • 1 个月前

DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I'll release when more mature, but it is almost sure that it will get merged.

DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I'll release when more mature, but it is almost sure that it will get merged.

antirez

83,448 次观看 • 1 个月前

"YE3" - pretty heavy scene running real-time on my base M4 Pro Macbook. These things are beasts for 3D work already. Imagine an M5 Max + 128 GB… new era for 3D graphics? Apple

"YE3" - pretty heavy scene running real-time on my base M4 Pro Macbook. These things are beasts for 3D work already. Imagine an M5 Max + 128 GB… new era for 3D graphics? Apple

Marcel Deneuve

17,068 次观看 • 3 个月前

This actually makes Gemma 4 26B-4A usable for a coding agent @ 72tk/s on my MacBook Pro M1 Max. This video is realtime, running completely locally.

This actually makes Gemma 4 26B-4A usable for a coding agent @ 72tk/s on my MacBook Pro M1 Max. This video is realtime, running completely locally.

Kyle Howells

117,577 次观看 • 6 天前

What does it take to run 3, 5, or even 10 concurrent instances of Gemma 4 locally? We've open-sourced a demo letting you run multiple models side-by-side on your hardware. Gemma 4 26B A4B easily runs 10+ concurrent requests on a MacBook Pro M4 Max at 18 tokens/sec per request.

What does it take to run 3, 5, or even 10 concurrent instances of Gemma 4 locally? We've open-sourced a demo letting you run multiple models side-by-side on your hardware. Gemma 4 26B A4B easily runs 10+ concurrent requests on a MacBook Pro M4 Max at 18 tokens/sec per request.

Google Gemma

911,885 次观看 • 1 个月前

Nemotron-3-Ultra running on 4x 6000s edits my latest demo video.. - 75 tok/s decode - 8x concurrency - 256k context - 899 tok/s prefill - 20k tok/s prefill cache - NVFP4 Setting it up to be my Hermes driver. It's good enough at most things and doesn't talk like a moron.

Nemotron-3-Ultra running on 4x 6000s edits my latest demo video.. - 75 tok/s decode - 8x concurrency - 256k context - 899 tok/s prefill - 20k tok/s prefill cache - NVFP4 Setting it up to be my Hermes driver. It's good enough at most things and doesn't talk like a moron.

0xSero

15,674 次观看 • 12 天前

Everyone's comparing the DGX Spark to a 5090 and calling it slow. I think that's the wrong comparison. I ran Qwen3.6 35B-A3B FP8 with the full 262K context window enabled (~96GB RAM) — something gaming GPUs can't really do. Results: 🟢No context: 51.3 tok/s, TTFT 110ms 🟣200K prefill: 34.6 tok/s, TTFT 85s (~2,341 tok/s prefill) Prefill is way faster than a Mac. And 35 tok/s deep into 200K context, on a model this strong, is genuinely usable. The Spark plays a different game.

Everyone's comparing the DGX Spark to a 5090 and calling it slow. I think that's the wrong comparison. I ran Qwen3.6 35B-A3B FP8 with the full 262K context window enabled (~96GB RAM) — something gaming GPUs can't really do. Results: 🟢No context: 51.3 tok/s, TTFT 110ms 🟣200K prefill: 34.6 tok/s, TTFT 85s (~2,341 tok/s prefill) Prefill is way faster than a Mac. And 35 tok/s deep into 200K context, on a model this strong, is genuinely usable. The Spark plays a different game.

stevibe

33,301 次观看 • 1 个月前