正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

I did it! It works! Using GLM-4.7-4bit with mlx_lm.server and opencode to fix real code locally! 🔥 Here single M3 Ultra 512GB, nex step phase will be 2 using Tensor Parallelism and then apply same changes to exo. Prefill is slow on a single machine, but generation is good.

Ivan Fioravanti ᯅ

18,834 subscribers

44,000 次观看 • 5 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

Ivan Fioravanti ᯅ

327,687 次观看 • 5 个月前

Running four high-level OpenCode agents + subagents with mlx_lm.server continuous batching and MiniMax M2.5 (6-bit). Fits easily on a 512GB M3 Ultra. Generation is quite fast. But prefill is still slow compared to cloud servers.

Running four high-level OpenCode agents + subagents with mlx_lm.server continuous batching and MiniMax M2.5 (6-bit). Fits easily on a 512GB M3 Ultra. Generation is quite fast. But prefill is still slow compared to cloud servers.

Awni Hannun

25,535 次观看 • 4 个月前

OpenCode + GLM-4.7-Flash 8bit. llama-server vs mlx_lm.server. Using one M3 Ultra for each host. Same prompts, you can check duration for each step in the video. Amazing experience on both side! Local coding AI model are becoming reality! 🔥

OpenCode + GLM-4.7-Flash 8bit. llama-server vs mlx_lm.server. Using one M3 Ultra for each host. Same prompts, you can check duration for each step in the video. Amazing experience on both side! Local coding AI model are becoming reality! 🔥

Ivan Fioravanti ᯅ

47,063 次观看 • 5 个月前

Running Minimax M2.1 (MiniMax (official)) with OpenCode (OpenCode) and mlx_lm.server. Works quite well on an M3 Ultra. Once the KV cache is warm the prompt processing is pretty quick. And token generation is very fast.

Running Minimax M2.1 (MiniMax (official)) with OpenCode (OpenCode) and mlx_lm.server. Works quite well on an M3 Ultra. Once the KV cache is warm the prompt processing is pretty quick. And token generation is very fast.

Awni Hannun

32,329 次观看 • 5 个月前

First kinda working implementation of GLM 5.2 in DwarfStar. Will take some time to be good enough, but it is a promising start. 433 GB GGUF file. M3 Ultra 512GB.

First kinda working implementation of GLM 5.2 in DwarfStar. Will take some time to be good enough, but it is a promising start. 433 GB GGUF file. M3 Ultra 512GB.

antirez

68,870 次观看 • 7 天前

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

Awni Hannun

95,050 次观看 • 5 个月前

Running GLM-4.7-Flash with OpenCode locally on M4 Max MacBook Pro. 4-bit model runs at 82 tok/sec. Prefill will get ~4x faster with M5 Max MacBook Pro (~28 Jan). EXO will also support disaggregating prefill and decode across devices, e.g. DGX Spark.

Running GLM-4.7-Flash with OpenCode locally on M4 Max MacBook Pro. 4-bit model runs at 82 tok/sec. Prefill will get ~4x faster with M5 Max MacBook Pro (~28 Jan). EXO will also support disaggregating prefill and decode across devices, e.g. DGX Spark.

Alex Cheema

127,902 次观看 • 5 个月前

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

Awni Hannun

60,446 次观看 • 4 个月前

MLX GLM 5.2 Distributed on two M3 Ultra 512GB 🔥 One M3 Ultra: 18.8 tokens/sec Two M3 Ultra: 23.4 tokens/sec Context: - PR by Pedro Cuenca is still open and probably there is room for improvement: - basic generation test to measure decoding performance here, I will do a full context benchmarking once PR is more mature - nvfp4 quantization used - Video alternates standard speed and x20, with one Mac first and distributed later. Enjoy! 🙌🏻

MLX GLM 5.2 Distributed on two M3 Ultra 512GB 🔥 One M3 Ultra: 18.8 tokens/sec Two M3 Ultra: 23.4 tokens/sec Context: - PR by Pedro Cuenca is still open and probably there is room for improvement: - basic generation test to measure decoding performance here, I will do a full context benchmarking once PR is more mature - nvfp4 quantization used - Video alternates standard speed and x20, with one Mac first and distributed later. Enjoy! 🙌🏻

Ivan Fioravanti ᯅ

86,488 次观看 • 7 天前

I didn't expect DeepSeek v4 PRO (not Flash) to run well on the Mac Studio M3 Ultra with 512GB of RAM. This is 2 bit quantized with the same DwarfStar recipe used for Flash. 433GB GGUF file. 130 t/s prefill, 13 t/s generation. Prefill in the video is low because small prompt.

I didn't expect DeepSeek v4 PRO (not Flash) to run well on the Mac Studio M3 Ultra with 512GB of RAM. This is 2 bit quantized with the same DwarfStar recipe used for Flash. 433GB GGUF file. 130 t/s prefill, 13 t/s generation. Prefill in the video is low because small prompt.

antirez

169,025 次观看 • 1 个月前

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Alex Cheema

62,144 次观看 • 5 个月前

MLX + OpenCode + Qwen3.5-122B-A10B-4bit on M3 Ultra created a great snake game! Work zero-shot. Video clearly in super fast mode during generation. I generated the prompt using Grok 4.20, it's in the article.

MLX + OpenCode + Qwen3.5-122B-A10B-4bit on M3 Ultra created a great snake game! Work zero-shot. Video clearly in super fast mode during generation. I generated the prompt using Grok 4.20, it's in the article.

Ivan Fioravanti ᯅ

74,659 次观看 • 4 个月前

For even higher throughput and lower latency: batch generation + tensor parallel with mlx-lm + and mlx.distributed. Here it's generating at 63 tok/sec (throughput) with GLM 4.7 in 6-bit and batch size 4 on 4 M3 Ultras:

For even higher throughput and lower latency: batch generation + tensor parallel with mlx-lm + and mlx.distributed. Here it's generating at 63 tok/sec (throughput) with GLM 4.7 in 6-bit and batch size 4 on 4 M3 Ultras:

Awni Hannun

22,721 次观看 • 6 个月前

And after 1 week of work, here is zml/llmd running transparently on TPU with full prefill/decode paged attention. No code change, single flag, as it should be.

And after 1 week of work, here is zml/llmd running transparently on TPU with full prefill/decode paged attention. No code change, single flag, as it should be.

Steeve Morin

26,073 次观看 • 10 个月前

DAAUUUMMMM! Deep Seek R1 - 4bit on a single Mac Studio 512gb. 18.26 Tokens per second with MLX. Took over a minute to load the model but I sped that up. Generation was great! thanks Awni Hannun mlx is the future.

DAAUUUMMMM! Deep Seek R1 - 4bit on a single Mac Studio 512gb. 18.26 Tokens per second with MLX. Took over a minute to load the model but I sped that up. Generation was great! thanks Awni Hannun mlx is the future.

Austin Vance

172,817 次观看 • 1 年前

OpenCode + MLX + Qwen3.5-397B-A17B-4bit. Video is 8x, but the goal is showing that It works! This is something unimaginable just few months ago. MLX Team is pushing like crazy and M5 Ultra will do the rest 🚀

OpenCode + MLX + Qwen3.5-397B-A17B-4bit. Video is 8x, but the goal is showing that It works! This is something unimaginable just few months ago. MLX Team is pushing like crazy and M5 Ultra will do the rest 🚀

Ivan Fioravanti ᯅ

48,692 次观看 • 4 个月前

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Alex Cheema

56,555 次观看 • 5 个月前

Marketing will never be the same I made 10 ads in 10 minutes using ChatGPT's new native image generation feature to create: • Ads • Memes • Infographics • Cheatsheets • And so much more All it takes is one prompt Here's how I made each and every single ad:

Marketing will never be the same I made 10 ads in 10 minutes using ChatGPT's new native image generation feature to create: • Ads • Memes • Infographics • Cheatsheets • And so much more All it takes is one prompt Here's how I made each and every single ad:

Zain Kahn

35,324 次观看 • 1 年前

I asked my clawdbot to send a letter to me in the mail… and it actually did it. I gave it a crypto wallet and some USDC to make purchases with, and clawdbot went and read the agent documentation on postalform. It figured out how to draft an order with a PDF of the letter it wrote to me – then it used Stripe’s Purl cli and paid for the order using Stripe's new Machine Payments protocol. This is a huge step forward for real-world agentic task completion and commerce

I asked my clawdbot to send a letter to me in the mail… and it actually did it. I gave it a crypto wallet and some USDC to make purchases with, and clawdbot went and read the agent documentation on postalform. It figured out how to draft an order with a PDF of the letter it wrote to me – then it used Stripe’s Purl cli and paid for the order using Stripe's new Machine Payments protocol. This is a huge step forward for real-world agentic task completion and commerce

Gabriel Garrett

36,119 次观看 • 4 个月前