正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Running four high-level OpenCode agents + subagents with mlx_lm.server continuous batching and MiniMax M2.5 (6-bit). Fits easily on a 512GB M3 Ultra. Generation is quite fast. But prefill is still slow compared to cloud servers.

Awni Hannun

38,393 subscribers

25,535 次观看 • 5 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

Awni Hannun

95,092 次观看 • 6 个月前

Running Minimax M2.1 (MiniMax (official)) with OpenCode (OpenCode) and mlx_lm.server. Works quite well on an M3 Ultra. Once the KV cache is warm the prompt processing is pretty quick. And token generation is very fast.

Running Minimax M2.1 (MiniMax (official)) with OpenCode (OpenCode) and mlx_lm.server. Works quite well on an M3 Ultra. Once the KV cache is warm the prompt processing is pretty quick. And token generation is very fast.

Awni Hannun

32,329 次观看 • 6 个月前

I did it! It works! Using GLM-4.7-4bit with mlx_lm.server and opencode to fix real code locally! 🔥 Here single M3 Ultra 512GB, nex step phase will be 2 using Tensor Parallelism and then apply same changes to exo. Prefill is slow on a single machine, but generation is good.

I did it! It works! Using GLM-4.7-4bit with mlx_lm.server and opencode to fix real code locally! 🔥 Here single M3 Ultra 512GB, nex step phase will be 2 using Tensor Parallelism and then apply same changes to exo. Prefill is slow on a single machine, but generation is good.

Ivan Fioravanti ᯅ

44,000 次观看 • 6 个月前

This is MiniMax-M2.5 MLX running in LM Studio on an Apple Mac Studio M3 Ultra 512GB. Fast enough out of the box for hosting OpenClaw, n8n workflows, and Open WebUI for the team.

This is MiniMax-M2.5 MLX running in LM Studio on an Apple Mac Studio M3 Ultra 512GB. Fast enough out of the box for hosting OpenClaw, n8n workflows, and Open WebUI for the team.

Patrick J Kennedy

73,547 次观看 • 5 个月前

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

Ivan Fioravanti ᯅ

24,376 次观看 • 1 个月前

I didn't expect DeepSeek v4 PRO (not Flash) to run well on the Mac Studio M3 Ultra with 512GB of RAM. This is 2 bit quantized with the same DwarfStar recipe used for Flash. 433GB GGUF file. 130 t/s prefill, 13 t/s generation. Prefill in the video is low because small prompt.

I didn't expect DeepSeek v4 PRO (not Flash) to run well on the Mac Studio M3 Ultra with 512GB of RAM. This is 2 bit quantized with the same DwarfStar recipe used for Flash. 433GB GGUF file. 130 t/s prefill, 13 t/s generation. Prefill in the video is low because small prompt.

antirez

169,897 次观看 • 2 个月前

MLX GLM 5.2 Distributed on two M3 Ultra 512GB 🔥 One M3 Ultra: 18.8 tokens/sec Two M3 Ultra: 23.4 tokens/sec Context: - PR by Pedro Cuenca is still open and probably there is room for improvement: - basic generation test to measure decoding performance here, I will do a full context benchmarking once PR is more mature - nvfp4 quantization used - Video alternates standard speed and x20, with one Mac first and distributed later. Enjoy! 🙌🏻

MLX GLM 5.2 Distributed on two M3 Ultra 512GB 🔥 One M3 Ultra: 18.8 tokens/sec Two M3 Ultra: 23.4 tokens/sec Context: - PR by Pedro Cuenca is still open and probably there is room for improvement: - basic generation test to measure decoding performance here, I will do a full context benchmarking once PR is more mature - nvfp4 quantization used - Video alternates standard speed and x20, with one Mac first and distributed later. Enjoy! 🙌🏻

Ivan Fioravanti ᯅ

87,805 次观看 • 1 个月前

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

Angelos Katharopoulos

66,202 次观看 • 3 个月前

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

Ivan Fioravanti ᯅ

327,687 次观看 • 6 个月前

GLM-5.2 8bit running on two M3 Ultra 512GB with MLX distributed? Here it is! 🚀 Decode speed: 17.9 tokens/sec 🔥 Memory used: ~ 760GB 👀 Again keep in mind it's a preliminary PR by super Pedro Cuenca still a WIP!

GLM-5.2 8bit running on two M3 Ultra 512GB with MLX distributed? Here it is! 🚀 Decode speed: 17.9 tokens/sec 🔥 Memory used: ~ 760GB 👀 Again keep in mind it's a preliminary PR by super Pedro Cuenca still a WIP!

Ivan Fioravanti ᯅ

34,140 次观看 • 1 个月前

MiniMax M2.5 > Claude Opus 4.6 I have been using it for a couple of days and have never hit the limit. I am currently on the Coding Plus plan which gives me 300 prompts per 5 hours. My new setup is OpenCode + MiniMax M2.5 What I have noticed: M2.5 – handles multi-step coding workflows and repo-level reasoning more reliably – better at chaining actions. think: generate test, fix bug, refactor, repeat – high TPS throughput M2.5 hits ~80%+ on SWE-Bench coding benchmarks on par with Opus and runs these tasks 37% faster than its predecessor while costing ~1/10th - 1/20th the price. Check out the video.

MiniMax M2.5 > Claude Opus 4.6 I have been using it for a couple of days and have never hit the limit. I am currently on the Coding Plus plan which gives me 300 prompts per 5 hours. My new setup is OpenCode + MiniMax M2.5 What I have noticed: M2.5 – handles multi-step coding workflows and repo-level reasoning more reliably – better at chaining actions. think: generate test, fix bug, refactor, repeat – high TPS throughput M2.5 hits ~80%+ on SWE-Bench coding benchmarks on par with Opus and runs these tasks 37% faster than its predecessor while costing ~1/10th - 1/20th the price. Check out the video.

Pratham

184,640 次观看 • 4 个月前

Minimax M3 is excellent at SVG generation, reaching close to Gemini 3.5 Flash levels and beating Opus 4.7 on SVG-Bench. With 1M context, native multimodality, strong agentic/coding ability and open weights coming soon, the closed-source moat is thinning fast. Full Video:

Minimax M3 is excellent at SVG generation, reaching close to Gemini 3.5 Flash levels and beating Opus 4.7 on SVG-Bench. With 1M context, native multimodality, strong agentic/coding ability and open weights coming soon, the closed-source moat is thinning fast. Full Video:

WorldofAI

16,499 次观看 • 1 个月前

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

GLM-5 runs with mlx-lm on a single 512GB M3 Ultra in Q4. It's quite good in my initial testing and pretty fast as well. It generated a highly functional space invaders game using 7.1k tokens at 15.4 tok/s and 419GB memory. Thanks to Gökdeniz Gülmez and Tarjei Mandt for the port.

Awni Hannun

60,599 次观看 • 5 个月前

MiniMax-M2.1 running fully local in AWQ-4Bit with full context window (170 GB VRAM w full context) - 1000~ to 16,000~ tps prefill - 100~ tps generation speeds - Opencode It’s doing real work, updating my blog with little steering or specificity. The problem with local LLMs is that they require too much steering, this means baby sitting which I don’t have the time to do MiniMax cracked the cost, intelligence, and speed challenge, I would say this is a top tier model. I run frontier models like Gemini and it just fails to call tools, in this year lol… ——————— I think glm-4.?-air is needed still. We need a viable model at each hardware entry point, a Mac M1 Ultra 192GB? is relatively cheap 5k to be able to run this model at 40 tps is a huge societal unlock. Smaller models can be good but size matters :p

MiniMax-M2.1 running fully local in AWQ-4Bit with full context window (170 GB VRAM w full context) - 1000~ to 16,000~ tps prefill - 100~ tps generation speeds - Opencode It’s doing real work, updating my blog with little steering or specificity. The problem with local LLMs is that they require too much steering, this means baby sitting which I don’t have the time to do MiniMax cracked the cost, intelligence, and speed challenge, I would say this is a top tier model. I run frontier models like Gemini and it just fails to call tools, in this year lol… ——————— I think glm-4.?-air is needed still. We need a viable model at each hardware entry point, a Mac M1 Ultra 192GB? is relatively cheap 5k to be able to run this model at 40 tps is a huge societal unlock. Smaller models can be good but size matters :p

0xSero

23,836 次观看 • 6 个月前

Agents are joining us at work -- coding, writing, design. But how do they actually work, especially compared to humans? Their workflows tell a different story: They code everything, slow down human flows, and deliver low-quality work fast. Yet when teamed with humans, they shine on easily programmable steps.

Agents are joining us at work -- coding, writing, design. But how do they actually work, especially compared to humans? Their workflows tell a different story: They code everything, slow down human flows, and deliver low-quality work fast. Yet when teamed with humans, they shine on easily programmable steps.

Zora Wang

94,255 次观看 • 8 个月前

NVIDIA sent us 2 DGX Sparks. For a while we wondered what we would do with them. The memory bandwidth is 273GB/s making it 3x slower than an M3 Ultra (819GB/s) for batch_size=1 inference. But it has 4x more FLOPS (100 TFLOPS compared to 26 TFLOPS). So we thought, what if we could combine the DGX Spark & M3 Ultra, and make use of both the massive compute on the DGX Spark and the massive memory-bandwidth on the M3 Ultra. We came up with a way to split inference across both devices and achieve a speedup of up to 4x for long prompts compared to the M3 Ultra on its own. Full details in the blog post linked below.

NVIDIA sent us 2 DGX Sparks. For a while we wondered what we would do with them. The memory bandwidth is 273GB/s making it 3x slower than an M3 Ultra (819GB/s) for batch_size=1 inference. But it has 4x more FLOPS (100 TFLOPS compared to 26 TFLOPS). So we thought, what if we could combine the DGX Spark & M3 Ultra, and make use of both the massive compute on the DGX Spark and the massive memory-bandwidth on the M3 Ultra. We came up with a way to split inference across both devices and achieve a speedup of up to 4x for long prompts compared to the M3 Ultra on its own. Full details in the blog post linked below.

Alex Cheema

281,225 次观看 • 9 个月前

working with cloud agents in cursor feels the same as local, but the fact that i can shut down my computer and still have it running is amazing. one of my favorite use cases is planning in cursor, then send off to cloud agent for impl. here's a short demo

working with cloud agents in cursor feels the same as local, but the fact that i can shut down my computer and still have it running is amazing. one of my favorite use cases is planning in cursor, then send off to cloud agent for impl. here's a short demo

eric zakariasson

61,940 次观看 • 8 个月前

Jensen Huang says the industry is shifting from generative AI to agentic AI The next major step is fusing public cloud frontier models with customized open-source systems running on enterprise servers But the end goal is physical: "moving intelligence from the cloud into industrial AI, factories and robotics"

Jensen Huang says the industry is shifting from generative AI to agentic AI The next major step is fusing public cloud frontier models with customized open-source systems running on enterprise servers But the end goal is physical: "moving intelligence from the cloud into industrial AI, factories and robotics"

Haider.

49,411 次观看 • 6 个月前

NVIDIA is giving away free access to 140+ AI models for a full year > most people building AI agents are paying $50-200/month for API access NVIDIA just made that argument irrelevant models you get: GLM 5.2, MiniMax M3, Nemotron-3-Ultra-550B-A55B, Kimi K2.7, and 130+ more setup: > step 1 - get your free key > go to > register -> bind phone -> copy API key > step 2 - add to Hermes agent > open Settings -> Model Provider -> Custom base_url = " api_key = "nvapi-xxxxxxxxxxxxxxxxxxxx" > step 3 - pick a model model = "z-ai/glm-5.2" model = "minimaxai/minimax-m3" model = "nvidia/nemotron-3-ultra-550b-a55b" model = "moonshot-ai/kimi-k2.7" > Hermes already has NVIDIA set as default base_url > paste the key and you're running instantly > works the same in Cursor and OpenCode > cost: $0 > limit: 40 req/min > expires: 1 year while everyone is paying for API access, this is sitting there for free with this many models in one place, it is also the easiest way to test which one actually fits your agent before you commit to a paid provider

NVIDIA is giving away free access to 140+ AI models for a full year > most people building AI agents are paying $50-200/month for API access NVIDIA just made that argument irrelevant models you get: GLM 5.2, MiniMax M3, Nemotron-3-Ultra-550B-A55B, Kimi K2.7, and 130+ more setup: > step 1 - get your free key > go to > register -> bind phone -> copy API key > step 2 - add to Hermes agent > open Settings -> Model Provider -> Custom base_url = " api_key = "nvapi-xxxxxxxxxxxxxxxxxxxx" > step 3 - pick a model model = "z-ai/glm-5.2" model = "minimaxai/minimax-m3" model = "nvidia/nemotron-3-ultra-550b-a55b" model = "moonshot-ai/kimi-k2.7" > Hermes already has NVIDIA set as default base_url > paste the key and you're running instantly > works the same in Cursor and OpenCode > cost: $0 > limit: 40 req/min > expires: 1 year while everyone is paying for API access, this is sitting there for free with this many models in one place, it is also the easiest way to test which one actually fits your agent before you commit to a paid provider

Mr. Buzzoni

83,469 次观看 • 9 天前

Health Secretary Wes Streeting MP celebrates a 6% rise in NHS satisfaction, with more than one in four adults now 'very' or 'quite' satisfied with the NHS, but says there is still work to be done.

Health Secretary Wes Streeting MP celebrates a 6% rise in NHS satisfaction, with more than one in four adults now 'very' or 'quite' satisfied with the NHS, but says there is still work to be done.

GB News

10,741 次观看 • 4 个月前