Name: Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).
Uploaded: 2026-01-08T23:00:53.000Z
Duration: PT18.008S
Channel: Awni Hannun
Description: Awni Hannun shorts video about Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).

Got continuous batching working with SSMs in mlx-lm. Here's... show more

Awni Hannun

35,078 次观看 • 5 个月前

Running DeepSeek-V3 on M4 Mac Mini AI Cluster 671B... show more

EXO Labs

719,005 次观看 • 1 年前

No-one: But can you do 16 generations on your... show more

Awni Hannun

46,713 次观看 • 8 个月前

Batching for vision models is now available in Beta... show more

LM Studio

46,015 次观看 • 1 个月前

First steps for a specialized DeepSeek v4 Flash inference... show more

antirez

14,159 次观看 • 1 个月前

Qwen QwQ 32B fp16 on M4 Max and M2... show more

Ivan Fioravanti ᯅ

62,377 次观看 • 1 年前

How much faster is the new MacBook Pro for... show more

Alex Cheema - e/acc

527,894 次观看 • 1 年前

Another demo of the iPhone 17 Pro’s on-device LLM... show more

Adrien Grondin

46,205 次观看 • 8 个月前

NVIDIA Nemotron 3 Nano Omni, a new multimodal reasoning... show more

NVIDIA Robotics

15,828 次观看 • 1 个月前

DeepSeek-Prover (4-bit 7B) running at 114 toks/sec in MLX... show more

Awni Hannun

16,078 次观看 • 1 年前

M4 Mac Mini AI Cluster Uses EXO Labs with... show more

Alex Cheema

3,515,988 次观看 • 1 年前

Tested the new MacBook Pro M4 Pro vs. the... show more

01000010

111,446 次观看 • 1 年前

Managed to get Ling Mini 16B (1.4B active) running... show more

Awni Hannun

92,422 次观看 • 8 个月前

Sam 3 by Facebook now on MLX 🚀 Here... show more

Prince Canuma

180,245 次观看 • 2 个月前

DeepSeek R1 Qwen 7B 4bit M2 Ultra vs M4... show more

Ivan Fioravanti ᯅ

59,734 次观看 • 1 年前

Currently working on a retro inspired action horror game... show more

Kathy (Prii)

83,894 次观看 • 2 年前

Introducing MON Protocol Partner - Hybrid Hybrid is an... show more

MON Protocol 🐉 $MON

182,639 次观看 • 2 年前

Qwen 3.5 0.8B, Gated DeltaNet attention is running on... show more

Anemll

13,589 次观看 • 3 个月前

Gotta love MoEs on Apple silicon with MLX. Kimi's... show more

Awni Hannun

23,209 次观看 • 1 年前

Sparsely activated models like MOEs and Apple silicon +... MLX are a great match. - Lots of RAM to hold the entire model in memory (not just the active parameters). For an MOE at each token you access basically a random subset of the model. Swapping large parts of the model to "disk" from token-to-token is too slow. - Comparatively you don't need as much memory bandwidth. Only a small fraction of the weights are used per token. In the case of DeepSeek v3 37B / 671B are active. So only ~5% of the weights are moved to GPU cache / register for each token. (SVG animation made with the help of DeepSeek V2 1210 + MLX on an M2 Ultra)show more

Awni Hannun

27,452 次观看 • 1 年前

Mac owners don't miss this: MLX LM is now... show more

Victor M

204,554 次观看 • 1 年前

Live Cam