
Matt Beton
@MattBeton • 2,570 subscribers
ml systems & algos | prev @exolabs, maths @Cambridge_Uni
Videos

Linear scaling achieved with multiple DeepSeek v3.1 instances. 4x macs = 4x throughput. 2x M3 Ultra Mac Studios = 1x DeepSeek @ 14 tok/sec 4x M3 Ultra Mac Studios = 2x DeepSeek @ 28 tok/sec DeepSeek V3.1 is a 671B parameter model - so at its native 8-bit quantization, it requires ~700GB of memory to run the model. EXO puts half of the layers on each device, combining their memory. EXO uses MLX distributed with TB5 interconnect, optimized for Apple Silicon. If we need higher throughput, adding two more devices lets us serve more users at once. EXO Labs handles all of this seamlessly - adding more devices to the cluster for linear scaling as we need it. The new EXO 1.0 will be open-source soonTM
Matt Beton158,485 просмотров • 9 месяцев назад
Больше нет контента для загрузки