Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

I did it! It works! Using GLM-4.7-4bit with mlx_lm.server and opencode to fix real code locally! 🔥 Here single M3 Ultra 512GB, nex step phase will be 2 using Tensor Parallelism and then apply same changes to exo. Prefill is slow on a single machine, but generation is good.

Ivan Fioravanti ᯅ

18,834 subscribers

44,000 Aufrufe • vor 5 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

Ivan Fioravanti ᯅ

327,687 Aufrufe • vor 5 Monaten

Running four high-level OpenCode agents + subagents with mlx_lm.server continuous batching and MiniMax M2.5 (6-bit). Fits easily on a 512GB M3 Ultra. Generation is quite fast. But prefill is still slow compared to cloud servers.

Running four high-level OpenCode agents + subagents with mlx_lm.server continuous batching and MiniMax M2.5 (6-bit). Fits easily on a 512GB M3 Ultra. Generation is quite fast. But prefill is still slow compared to cloud servers.

Awni Hannun

25,535 Aufrufe • vor 4 Monaten

OpenCode + GLM-4.7-Flash 8bit. llama-server vs mlx_lm.server. Using one M3 Ultra for each host. Same prompts, you can check duration for each step in the video. Amazing experience on both side! Local coding AI model are becoming reality! 🔥

OpenCode + GLM-4.7-Flash 8bit. llama-server vs mlx_lm.server. Using one M3 Ultra for each host. Same prompts, you can check duration for each step in the video. Amazing experience on both side! Local coding AI model are becoming reality! 🔥

Ivan Fioravanti ᯅ

47,063 Aufrufe • vor 4 Monaten

Running Minimax M2.1 (MiniMax (official)) with OpenCode (OpenCode) and mlx_lm.server. Works quite well on an M3 Ultra. Once the KV cache is warm the prompt processing is pretty quick. And token generation is very fast.

Running Minimax M2.1 (MiniMax (official)) with OpenCode (OpenCode) and mlx_lm.server. Works quite well on an M3 Ultra. Once the KV cache is warm the prompt processing is pretty quick. And token generation is very fast.

Awni Hannun

32,329 Aufrufe • vor 5 Monaten

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

Awni Hannun

95,050 Aufrufe • vor 5 Monaten

Running GLM-4.7-Flash with OpenCode locally on M4 Max MacBook Pro. 4-bit model runs at 82 tok/sec. Prefill will get ~4x faster with M5 Max MacBook Pro (~28 Jan). EXO will also support disaggregating prefill and decode across devices, e.g. DGX Spark.

Running GLM-4.7-Flash with OpenCode locally on M4 Max MacBook Pro. 4-bit model runs at 82 tok/sec. Prefill will get ~4x faster with M5 Max MacBook Pro (~28 Jan). EXO will also support disaggregating prefill and decode across devices, e.g. DGX Spark.

Alex Cheema

127,646 Aufrufe • vor 4 Monaten

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

Awni Hannun

60,446 Aufrufe • vor 4 Monaten

I didn't expect DeepSeek v4 PRO (not Flash) to run well on the Mac Studio M3 Ultra with 512GB of RAM. This is 2 bit quantized with the same DwarfStar recipe used for Flash. 433GB GGUF file. 130 t/s prefill, 13 t/s generation. Prefill in the video is low because small prompt.

I didn't expect DeepSeek v4 PRO (not Flash) to run well on the Mac Studio M3 Ultra with 512GB of RAM. This is 2 bit quantized with the same DwarfStar recipe used for Flash. 433GB GGUF file. 130 t/s prefill, 13 t/s generation. Prefill in the video is low because small prompt.

antirez

169,025 Aufrufe • vor 1 Monat

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Alex Cheema

62,144 Aufrufe • vor 5 Monaten

MLX + OpenCode + Qwen3.5-122B-A10B-4bit on M3 Ultra created a great snake game! Work zero-shot. Video clearly in super fast mode during generation. I generated the prompt using Grok 4.20, it's in the article.

MLX + OpenCode + Qwen3.5-122B-A10B-4bit on M3 Ultra created a great snake game! Work zero-shot. Video clearly in super fast mode during generation. I generated the prompt using Grok 4.20, it's in the article.

Ivan Fioravanti ᯅ

74,659 Aufrufe • vor 3 Monaten

For even higher throughput and lower latency: batch generation + tensor parallel with mlx-lm + and mlx.distributed. Here it's generating at 63 tok/sec (throughput) with GLM 4.7 in 6-bit and batch size 4 on 4 M3 Ultras:

For even higher throughput and lower latency: batch generation + tensor parallel with mlx-lm + and mlx.distributed. Here it's generating at 63 tok/sec (throughput) with GLM 4.7 in 6-bit and batch size 4 on 4 M3 Ultras:

Awni Hannun

22,721 Aufrufe • vor 5 Monaten

And after 1 week of work, here is zml/llmd running transparently on TPU with full prefill/decode paged attention. No code change, single flag, as it should be.

And after 1 week of work, here is zml/llmd running transparently on TPU with full prefill/decode paged attention. No code change, single flag, as it should be.

Steeve Morin

26,073 Aufrufe • vor 9 Monaten

DAAUUUMMMM! Deep Seek R1 - 4bit on a single Mac Studio 512gb. 18.26 Tokens per second with MLX. Took over a minute to load the model but I sped that up. Generation was great! thanks Awni Hannun mlx is the future.

DAAUUUMMMM! Deep Seek R1 - 4bit on a single Mac Studio 512gb. 18.26 Tokens per second with MLX. Took over a minute to load the model but I sped that up. Generation was great! thanks Awni Hannun mlx is the future.

Austin Vance

172,817 Aufrufe • vor 1 Jahr

OpenCode + MLX + Qwen3.5-397B-A17B-4bit. Video is 8x, but the goal is showing that It works! This is something unimaginable just few months ago. MLX Team is pushing like crazy and M5 Ultra will do the rest 🚀

OpenCode + MLX + Qwen3.5-397B-A17B-4bit. Video is 8x, but the goal is showing that It works! This is something unimaginable just few months ago. MLX Team is pushing like crazy and M5 Ultra will do the rest 🚀

Ivan Fioravanti ᯅ

48,692 Aufrufe • vor 4 Monaten

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Alex Cheema

56,555 Aufrufe • vor 4 Monaten

Marketing will never be the same I made 10 ads in 10 minutes using ChatGPT's new native image generation feature to create: • Ads • Memes • Infographics • Cheatsheets • And so much more All it takes is one prompt Here's how I made each and every single ad:

Marketing will never be the same I made 10 ads in 10 minutes using ChatGPT's new native image generation feature to create: • Ads • Memes • Infographics • Cheatsheets • And so much more All it takes is one prompt Here's how I made each and every single ad:

Zain Kahn

35,324 Aufrufe • vor 1 Jahr

I asked my clawdbot to send a letter to me in the mail… and it actually did it. I gave it a crypto wallet and some USDC to make purchases with, and clawdbot went and read the agent documentation on postalform. It figured out how to draft an order with a PDF of the letter it wrote to me – then it used Stripe’s Purl cli and paid for the order using Stripe's new Machine Payments protocol. This is a huge step forward for real-world agentic task completion and commerce

I asked my clawdbot to send a letter to me in the mail… and it actually did it. I gave it a crypto wallet and some USDC to make purchases with, and clawdbot went and read the agent documentation on postalform. It figured out how to draft an order with a PDF of the letter it wrote to me – then it used Stripe’s Purl cli and paid for the order using Stripe's new Machine Payments protocol. This is a huge step forward for real-world agentic task completion and commerce

Gabriel Garrett

36,119 Aufrufe • vor 4 Monaten

MiniMax-M2.1 running fully local in AWQ-4Bit with full context window (170 GB VRAM w full context) - 1000~ to 16,000~ tps prefill - 100~ tps generation speeds - Opencode It’s doing real work, updating my blog with little steering or specificity. The problem with local LLMs is that they require too much steering, this means baby sitting which I don’t have the time to do MiniMax cracked the cost, intelligence, and speed challenge, I would say this is a top tier model. I run frontier models like Gemini and it just fails to call tools, in this year lol… ——————— I think glm-4.?-air is needed still. We need a viable model at each hardware entry point, a Mac M1 Ultra 192GB? is relatively cheap 5k to be able to run this model at 40 tps is a huge societal unlock. Smaller models can be good but size matters :p

MiniMax-M2.1 running fully local in AWQ-4Bit with full context window (170 GB VRAM w full context) - 1000~ to 16,000~ tps prefill - 100~ tps generation speeds - Opencode It’s doing real work, updating my blog with little steering or specificity. The problem with local LLMs is that they require too much steering, this means baby sitting which I don’t have the time to do MiniMax cracked the cost, intelligence, and speed challenge, I would say this is a top tier model. I run frontier models like Gemini and it just fails to call tools, in this year lol… ——————— I think glm-4.?-air is needed still. We need a viable model at each hardware entry point, a Mac M1 Ultra 192GB? is relatively cheap 5k to be able to run this model at 40 tps is a huge societal unlock. Smaller models can be good but size matters :p

0xSero

23,804 Aufrufe • vor 5 Monaten

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

Ivan Fioravanti ᯅ

24,155 Aufrufe • vor 9 Tagen