
Alex Cheema
@alexocheema • 47,956 subscribers
building @exolabs | prev @UniOfOxford We're hiring: https://t.co/UlkApFndnH
Shorts
Videos

2 MacBooks is all you need. Llama 3.1 405B running distributed across 2 MacBooks using exo intern home AI cluster
Alex Cheema1,287,557 görüntüleme • 1 yıl önce

"Somebody got one of the small versions of Llama to run on Windows 98...We could've been talking to our computers in English for the last 30 years" - Marc Andreessen 🇺🇸 It was me! I got Llama running on a Pentium II machine with 128MB RAM running Windows 98. Details below.
Alex Cheema765,804 görüntüleme • 1 yıl önce

NVIDIA sent us 2 DGX Sparks. For a while we wondered what we would do with them. The memory bandwidth is 273GB/s making it 3x slower than an M3 Ultra (819GB/s) for batch_size=1 inference. But it has 4x more FLOPS (100 TFLOPS compared to 26 TFLOPS). So we thought, what if we could combine the DGX Spark & M3 Ultra, and make use of both the massive compute on the DGX Spark and the massive memory-bandwidth on the M3 Ultra. We came up with a way to split inference across both devices and achieve a speedup of up to 4x for long prompts compared to the M3 Ultra on its own. Full details in the blog post linked below.
Alex Cheema281,204 görüntüleme • 7 ay önce

NVIDIA DGX Spark. World’s smallest AI supercomputer with 128GB memory
Alex Cheema312,051 görüntüleme • 1 yıl önce

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.
Alex Cheema62,069 görüntüleme • 4 ay önce

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.
Alex Cheema55,943 görüntüleme • 4 ay önce

"Mac Minis for example are a very good fit" - Andrej Karpathy Andrej Karpathy shouted out my work on EXO Labs in his keynote at Y Combinator AI SUS! Here's the breakdown: Right now most AI workloads run in the cloud where requests from different users are continuously batched together. These workloads are FLOPS-bound and favors hardware with the best unit economics of $ per FLOP, i.e. enterprise GPUs. The personal computing revolution will shift these workloads to personal devices with lower batch sizes (mostly batch_size=1). batch_size=1 inference is memory-bound, because all of the model parameters need to be loaded into the GPU every time a token is generated. Apple Silicon with its Unified Memory architecture has a lot of memory and memory bandwidth per $ compared to other hardware: - M4 Pro Mac Mini, 24GB @ 273GB/s, $58.33/GB, $5.13/GB/s - H100, 80GB @ 3350GB/s, $625/GB, $14.93/GB/s The unit economics of Apple Silicon are becoming more compelling with every release of the Mac. The future of AI inference looks more like open weights models (OpenAI open weights model soonTM) run at low batch_size on personal devices.
Alex Cheema103,487 görüntüleme • 11 ay önce

Latest Fireship video featured my run of DeepSeek R1 on M4 Mac Minis. Apple Silicon dominates in memory/bw unit economics, ideal for huge MoE models like R1 at batch_size=1 (the real-world use-case). SOTA AI may come from China, but it will run on American hardware.
Alex Cheema - e/acc79,775 görüntüleme • 1 yıl önce
Daha fazla içerik yok.