EXO Labs's banner
EXO Labs's profile picture

EXO Labs

@exolabs50,419 subscribers

Frontier AI on local hardware. EXO 1.0 is now open-source (Apache 2.0): https://t.co/SGGGK784Qp

Shorts

Running DeepSeek-V3 on M4 Mac Mini AI Cluster 671B MoE model distributed across 8 M4 Pro 64GB Mac Minis. Apple Silicon with unified memory is a great fit for MoE.

Running DeepSeek-V3 on M4 Mac Mini AI Cluster 671B MoE model distributed across 8 M4 Pro 64GB Mac Minis. Apple Silicon with unified memory is a great fit for MoE.

719,005 次观看

EXO 1.0 at Apple NeurIPS booth. 4 x 512GB M3 Ultra Mac Studios clustered all-to-all with RDMA over Thunderbolt 5. Runs DeepSeek v3.2 (8-bit) at 25 tok/sec with tensor parallelism.

EXO 1.0 at Apple NeurIPS booth. 4 x 512GB M3 Ultra Mac Studios clustered all-to-all with RDMA over Thunderbolt 5. Runs DeepSeek v3.2 (8-bit) at 25 tok/sec with tensor parallelism.

185,728 次观看

Distributed training on M4 Mac Mini cluster We implemented Google DeepMind DiLoCo on Apple Silicon to train large models with 100-1000x less bandwidth compared to DDP baseline. AI is entering a new era where a distributed network of consumer devices can train large models.

Distributed training on M4 Mac Mini cluster We implemented Google DeepMind DiLoCo on Apple Silicon to train large models with 100-1000x less bandwidth compared to DDP baseline. AI is entering a new era where a distributed network of consumer devices can train large models.

347,553 次观看

Day 1: Benchmarks We ran 1000+ LLM benchmarks on real consumer devices. Data includes single-device and multi-device clusters with Tokens-Per-Second (TPS) and Time-To-First-Token (TTFT). Setups tested: 3x M4 Mac Mini cluster, iPhone 15 + S24, RTX4090 & more.

Day 1: Benchmarks We ran 1000+ LLM benchmarks on real consumer devices. Data includes single-device and multi-device clusters with Tokens-Per-Second (TPS) and Time-To-First-Token (TTFT). Setups tested: 3x M4 Mac Mini cluster, iPhone 15 + S24, RTX4090 & more.

254,968 次观看

First look at SPARTA, a distributed AI training algorithm that avoids synchronization by randomly exchanging sparse sets of parameters ( 1,000x reduction in inter-GPU communication, enabling training of large models over slow bandwidths without specialized infrastructure. SPARTA works on its own but can also be combined with sync-based low communication training algorithms like DiLoCo for even better performance.

First look at SPARTA, a distributed AI training algorithm that avoids synchronization by randomly exchanging sparse sets of parameters ( 1,000x reduction in inter-GPU communication, enabling training of large models over slow bandwidths without specialized infrastructure. SPARTA works on its own but can also be combined with sync-based low communication training algorithms like DiLoCo for even better performance.

99,289 次观看

“EXO, get me in the mood for the holidays” A home assistant you can trust. It’s 100% Open-Source and runs locally. AI that’s aligned with you, not a for-profit company.

“EXO, get me in the mood for the holidays” A home assistant you can trust. It’s 100% Open-Source and runs locally. AI that’s aligned with you, not a for-profit company.

65,515 次观看

But the KV cache is created for each transformer layer. By sending each layer’s KV cache after it’s computed, we overlap communication with computation. We stream the KV cache and hide the network delay. We achieve a 4x speedup in prefill & 3x in decode, with 0 network delay.

But the KV cache is created for each transformer layer. By sending each layer’s KV cache after it’s computed, we overlap communication with computation. We stream the KV cache and hide the network delay. We achieve a 4x speedup in prefill & 3x in decode, with 0 network delay.

22,604 次观看

Videos

没有更多内容可加载