正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).

Awni Hannun

43,926 subscribers

35,078 次观看 • 5 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Running DeepSeek-V3 on M4 Mac Mini AI Cluster 671B MoE model distributed across 8 M4 Pro 64GB Mac Minis. Apple Silicon with unified memory is a great fit for MoE.

Running DeepSeek-V3 on M4 Mac Mini AI Cluster 671B MoE model distributed across 8 M4 Pro 64GB Mac Minis. Apple Silicon with unified memory is a great fit for MoE.

EXO Labs

719,005 次观看 • 1 年前

No-one: But can you do 16 generations on your M4 laptop simultaneously? MLX LM:

No-one: But can you do 16 generations on your M4 laptop simultaneously? MLX LM:

Awni Hannun

46,713 次观看 • 8 个月前

Batching for vision models is now available in Beta with our latest MLX engine update 👾 The updated engine also brings major improvements to caching for faster inference overall. Turn on Developer Mode, choose the beta runtime channel, and select LM Studio MLX v1.8.1.

Batching for vision models is now available in Beta with our latest MLX engine update 👾 The updated engine also brings major improvements to caching for faster inference overall. Turn on Developer Mode, choose the beta runtime channel, and select LM Studio MLX v1.8.1.

LM Studio

46,015 次观看 • 27 天前

First steps for a specialized DeepSeek v4 Flash inference engine focused on inference quality / stability at different quantizations, with networked API that is batching capable. This is the 2 bit quants model running on my M3 Max 128GB.

First steps for a specialized DeepSeek v4 Flash inference engine focused on inference quality / stability at different quantizations, with networked API that is batching capable. This is the 2 bit quants model running on my M3 Max 128GB.

antirez

14,159 次观看 • 1 个月前

Qwen QwQ 32B fp16 on M4 Max and M2 Ultra powered by MLX! M2 Ultra - 10.2 toks/s M4 Max - 7.6 toks/s! "Create an amazing animation using p5js" o1-mini level local model! Note: use temp 0.7-0.75 for optimal results in coding. I did some tests and this

Qwen QwQ 32B fp16 on M4 Max and M2 Ultra powered by MLX! M2 Ultra - 10.2 toks/s M4 Max - 7.6 toks/s! "Create an amazing animation using p5js" o1-mini level local model! Note: use temp 0.7-0.75 for optimal results in coding. I did some tests and this

Ivan Fioravanti ᯅ

62,377 次观看 • 1 年前

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

Alex Cheema - e/acc

527,894 次观看 • 1 年前

Another demo of the iPhone 17 Pro’s on-device LLM performance This time with Ling mini 2.0 by InclusionAI, a 16B MoE model with 1.4B active parameters running at ~120tk/s Thanks to Awni Hannun for the MLX DWQ 2-bit quants

Another demo of the iPhone 17 Pro’s on-device LLM performance This time with Ling mini 2.0 by InclusionAI, a 16B MoE model with 1.4B active parameters running at ~120tk/s Thanks to Awni Hannun for the MLX DWQ 2-bit quants

Adrien Grondin

46,205 次观看 • 8 个月前

NVIDIA Nemotron 3 Nano Omni, a new multimodal reasoning model, is now live on Jetson AI Lab and unifies vision, audio, and language into a single reasoning loop. 🙌 Power your NemoClaws by running this model with Ollama, vLLM and other inference frameworks on NVIDIA Jetson hardware. Try it ➡️

NVIDIA Nemotron 3 Nano Omni, a new multimodal reasoning model, is now live on Jetson AI Lab and unifies vision, audio, and language into a single reasoning loop. 🙌 Power your NemoClaws by running this model with Ollama, vLLM and other inference frameworks on NVIDIA Jetson hardware. Try it ➡️

NVIDIA Robotics

15,828 次观看 • 1 个月前

DeepSeek-Prover (4-bit 7B) running at 114 toks/sec in MLX LM on an M2 Ultra Thanks to for the port!

DeepSeek-Prover (4-bit 7B) running at 114 toks/sec in MLX LM on an M2 Ultra Thanks to for the port!

Awni Hannun

16,077 次观看 • 1 年前

M4 Mac Mini AI Cluster Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run LLMs distributed across 4 M4 Pro Mac Minis. The cluster is small (iPhone for reference). It’s running Nemotron 70B at 8 tok/sec and scales to Llama 405B (benchmarks soon).

M4 Mac Mini AI Cluster Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run LLMs distributed across 4 M4 Pro Mac Minis. The cluster is small (iPhone for reference). It’s running Nemotron 70B at 8 tok/sec and scales to Llama 405B (benchmarks soon).

Alex Cheema

3,515,929 次观看 • 1 年前

Tested the new MacBook Pro M4 Pro vs. the Mac mini M1 in LM Studio, running Llama 3.2 3B 4-bit MLX. Results: M1 (8-Core) ------> 25 tok/sec M4 Pro (14-Core) -> 104 tok/sec 🤯 – 4x faster!

Tested the new MacBook Pro M4 Pro vs. the Mac mini M1 in LM Studio, running Llama 3.2 3B 4-bit MLX. Results: M1 (8-Core) ------> 25 tok/sec M4 Pro (14-Core) -> 104 tok/sec 🤯 – 4x faster!

01000010

111,446 次观看 • 1 年前

Managed to get Ling Mini 16B (1.4B active) running on my iPhone Air. It runs very fast with MLX. It's a DWQ of Ling Mini quantized to 3 bits-per-weight. A 16B model running on an Air at this speed is pretty awesome:

Managed to get Ling Mini 16B (1.4B active) running on my iPhone Air. It runs very fast with MLX. It's a DWQ of Ling Mini quantized to 3 bits-per-weight. A 16B model running on an Air at this speed is pretty awesome:

Awni Hannun

92,422 次观看 • 8 个月前

Sam 3 by Facebook now on MLX 🚀 Here is a realtime object tracking running on M3 Max 96GB.

Sam 3 by Facebook now on MLX 🚀 Here is a realtime object tracking running on M3 Max 96GB.

Prince Canuma

180,245 次观看 • 2 个月前

DeepSeek R1 Qwen 7B 4bit M2 Ultra vs M4 Max on Apple MLX 🤫 Let them think... (video 4x in center part) M2 Ultra: 114.9 tokens per sec M4 Max (14"): 88.3 tokens per sec

DeepSeek R1 Qwen 7B 4bit M2 Ultra vs M4 Max on Apple MLX 🤫 Let them think... (video 4x in center part) M2 Ultra: 114.9 tokens per sec M4 Max (14"): 88.3 tokens per sec

Ivan Fioravanti ᯅ

59,734 次观看 • 1 年前

Currently working on a retro inspired action horror game with a small team DDDistortion Here's the test lowpoly player model I made

Currently working on a retro inspired action horror game with a small team DDDistortion Here's the test lowpoly player model I made

Kathy (Prii)

83,894 次观看 • 2 年前

Introducing MON Protocol Partner - Hybrid Hybrid is an Ethereum-based Layer 2 blockchain that integrates a Mixture of Experts (MoE) framework, enabling easy creation and monetization of AI agents in a plug-and-play approach. More about Hybrid here: Hybrid

Introducing MON Protocol Partner - Hybrid Hybrid is an Ethereum-based Layer 2 blockchain that integrates a Mixture of Experts (MoE) framework, enabling easy creation and monetization of AI agents in a plug-and-play approach. More about Hybrid here: Hybrid

MON Protocol 🐉 $MON

182,631 次观看 • 2 年前

Qwen 3.5 0.8B, Gated DeltaNet attention is running on Apple Neural Engine ~56 t/s in LUT6 quantization with some room for optimization left. It is CoreML, Swift and IOSurface on M4Pro. It will slow down as we increase context, but not by much. I think Private API opens the way to integrate ANE with GPU/MLX and possibly some MoE.

Qwen 3.5 0.8B, Gated DeltaNet attention is running on Apple Neural Engine ~56 t/s in LUT6 quantization with some room for optimization left. It is CoreML, Swift and IOSurface on M4Pro. It will slow down as we increase context, but not by much. I think Private API opens the way to integrate ANE with GPU/MLX and possibly some MoE.

Anemll

13,589 次观看 • 3 个月前

Gotta love MoEs on Apple silicon with MLX. Kimi's new 16B (3B active) Moonshot model runs very nicely on an M4 Max. As good or better than some of the best dense 7Bs and 1.5x faster inference (154 toks/sec!):

Gotta love MoEs on Apple silicon with MLX. Kimi's new 16B (3B active) Moonshot model runs very nicely on an M4 Max. As good or better than some of the best dense 7Bs and 1.5x faster inference (154 toks/sec!):

Awni Hannun

23,209 次观看 • 1 年前

$Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in memory (not just the active parameters). For an MOE at each token you access basically a random subset of the model. Swapping large parts of the model to "disk" from token-to-token is too slow. - Comparatively you don't need as much memory bandwidth. Only a small fraction of the weights are used per token. In the case of DeepSeek v3 37B / 671B are active. So only ~5% of the weights are moved to GPU cache / register for each token. (SVG animation made with the help of DeepSeek V2 1210 + MLX on an M2 Ultra)$

Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in memory (not just the active parameters). For an MOE at each token you access basically a random subset of the model. Swapping large parts of the model to "disk" from token-to-token is too slow. - Comparatively you don't need as much memory bandwidth. Only a small fraction of the weights are used per token. In the case of DeepSeek v3 37B / 671B are active. So only ~5% of the weights are moved to GPU cache / register for each token. (SVG animation made with the help of DeepSeek V2 1210 + MLX on an M2 Ultra)

Awni Hannun

27,452 次观看 • 1 年前

Mac owners don't miss this: MLX LM is now integrated directly within Hugging Face 🤯 ⬇️ Run 4,400+ LLMs locally on Apple Silicon at max speed, no cloud, no wait.

Mac owners don't miss this: MLX LM is now integrated directly within Hugging Face 🤯 ⬇️ Run 4,400+ LLMs locally on Apple Silicon at max speed, no cloud, no wait.

Victor M

204,554 次观看 • 1 年前