Alex Cheema's banner

Alex Cheema

@alexocheema • 52,248 subscribers

building @exolabs | prev @UniOfOxford We're hiring: https://t.co/UlkApFndnH

Shorts

Running Kimi K2.5 on my desk. Runs at 24 tok/sec with 2 x 512GB M3 Ultra Mac Studios connected with Thunderbolt 5 (RDMA) using EXO Labs / MLX backend. Yes, it can run clawdbot.

Running Kimi K2.5 on my desk. Runs at 24 tok/sec with 2 x 512GB M3 Ultra Mac Studios connected with Thunderbolt 5 (RDMA) using EXO Labs / MLX backend. Yes, it can run clawdbot.

3,283,246 görüntüleme

M4 Mac Mini AI Cluster Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run LLMs distributed across 4 M4 Pro Mac Minis. The cluster is small (iPhone for reference). It’s running Nemotron 70B at 8 tok/sec and scales to Llama 405B (benchmarks soon).

M4 Mac Mini AI Cluster Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run LLMs distributed across 4 M4 Pro Mac Minis. The cluster is small (iPhone for reference). It’s running Nemotron 70B at 8 tok/sec and scales to Llama 405B (benchmarks soon).

3,516,680 görüntüleme

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

1,934,910 görüntüleme

Running DeepSeek R1 on my desk Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run the full (671B, 8-bit) DeepSeek R1 distributed across 2 M3 Ultra 512GB Mac Studios (1TB total Unified Memory). Runs at 11 tok/sec. Theoretical max is ~20 tok/sec.

Running DeepSeek R1 on my desk Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run the full (671B, 8-bit) DeepSeek R1 distributed across 2 M3 Ultra 512GB Mac Studios (1TB total Unified Memory). Runs at 11 tok/sec. Theoretical max is ~20 tok/sec.

992,292 görüntüleme

35.9 tok/sec on Windows 98 🤯 This is a 260K LLM with Llama-architecture. We also tried out larger models. Results in the blog post.

35.9 tok/sec on Windows 98 🤯 This is a 260K LLM with Llama-architecture. We also tried out larger models. Results in the blog post.

569,400 görüntüleme

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

529,690 görüntüleme

Running the full GLM 4.7 (8-bit) on 2 x 512GB M3 Ultra Mac Studios Runs at 19.8 tok/sec with EXO Labs MLX RDMA backend (h/t Awni Hannun) & tensor parallel

Running the full GLM 4.7 (8-bit) on 2 x 512GB M3 Ultra Mac Studios Runs at 19.8 tok/sec with EXO Labs MLX RDMA backend (h/t Awni Hannun) & tensor parallel

156,668 görüntüleme

"It's on my todo to build my own stack of apple minis and run Llama on my own little cluster on my desk." - Garry Tan Quick AI homelab tutorial: - Install EXO Labs on each mini (open source, steps in README) - Make sure all the minis are on the same WiFi network or for faster/more reliable results, connected over Thunderbolt/Ethernet. Devices will auto-discover each other and auto-shard the LLMs. - Now you can chat to your cluster with your preferred LLM GUI (the tiny corp tinychat ships with exo)

"It's on my todo to build my own stack of apple minis and run Llama on my own little cluster on my desk." - Garry Tan Quick AI homelab tutorial: - Install EXO Labs on each mini (open source, steps in README) - Make sure all the minis are on the same WiFi network or for faster/more reliable results, connected over Thunderbolt/Ethernet. Devices will auto-discover each other and auto-shard the LLMs. - Now you can chat to your cluster with your preferred LLM GUI (the tiny corp tinychat ships with exo)

243,248 görüntüleme

Preparing to run the new Llama 3.3 70B on my mac cluster. Downloading model shards onto 3 x M4 Pro Mac Mini and 1 x M3 Max MacBook Pro. AI cluster is connected by Gigabit ethernet switch with EXO Labs

Preparing to run the new Llama 3.3 70B on my mac cluster. Downloading model shards onto 3 x M4 Pro Mac Mini and 1 x M3 Max MacBook Pro. AI cluster is connected by Gigabit ethernet switch with EXO Labs

181,917 görüntüleme

M4 Mac Mini AI cluster running Llama 3.3 70B Preliminary results show Gigabit ethernet switch (4 tok/sec) is slower than Thunderbolt-5 with EXO Labs Benchmarks including Thunderbolt-5 coming soonTM.

M4 Mac Mini AI cluster running Llama 3.3 70B Preliminary results show Gigabit ethernet switch (4 tok/sec) is slower than Thunderbolt-5 with EXO Labs Benchmarks including Thunderbolt-5 coming soonTM.

73,090 görüntüleme

Burning an LLM onto a CD. Guess why? Best answer gets a mac mini cake.

Burning an LLM onto a CD. Guess why? Best answer gets a mac mini cake.

17,612 görüntüleme

Real-time distributed inference monitoring is live on exo intern home AI cluster It comes with out of the box support for PrometheusMonitoring metrics and Grafana dashboards Helps diagnose performance bottlenecks in your exo cluster

Real-time distributed inference monitoring is live on exo intern home AI cluster It comes with out of the box support for PrometheusMonitoring metrics and Grafana dashboards Helps diagnose performance bottlenecks in your exo cluster

17,798 görüntüleme

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

GLM-5.2 running on Mac Studio Clusters or RTX GPU rigs isn't how Local AI will get mass adopted. For mass adoption we need to focus on two axes: 1. Intelligence/Speed Frontier. This is being tracked on local dot ai. Aggressive co-design is needed across the entire stack to push up the frontier specifically for local hardware. Better kernels, inference tricks like spec decoding, quantizing models, model routing. 2. Usability / UX. To use Local AI today, you need to install CLIs, select a model, a quantization, a harness that works best with that model, and you need to configure your inference engine. It's not very accessible. More work needs to be done on making this seamless, so the user doesn't even know the AI is running locally.

GLM-5.2 running on Mac Studio Clusters or RTX GPU rigs isn't how Local AI will get mass adopted. For mass adoption we need to focus on two axes: 1. Intelligence/Speed Frontier. This is being tracked on local dot ai. Aggressive co-design is needed across the entire stack to push up the frontier specifically for local hardware. Better kernels, inference tricks like spec decoding, quantizing models, model routing. 2. Usability / UX. To use Local AI today, you need to install CLIs, select a model, a quantization, a harness that works best with that model, and you need to configure your inference engine. It's not very accessible. More work needs to be done on making this seamless, so the user doesn't even know the AI is running locally.

30,443 görüntüleme • 12 gün önce

2 MacBooks is all you need. Llama 3.1 405B running distributed across 2 MacBooks using exo intern home AI cluster

2 MacBooks is all you need. Llama 3.1 405B running distributed across 2 MacBooks using exo intern home AI cluster

1,288,858 görüntüleme • 2 yıl önce

"Somebody got one of the small versions of Llama to run on Windows 98...We could've been talking to our computers in English for the last 30 years" - Marc Andreessen 🇺🇸 It was me! I got Llama running on a Pentium II machine with 128MB RAM running Windows 98. Details below.

"Somebody got one of the small versions of Llama to run on Windows 98...We could've been talking to our computers in English for the last 30 years" - Marc Andreessen 🇺🇸 It was me! I got Llama running on a Pentium II machine with 128MB RAM running Windows 98. Details below.

765,909 görüntüleme • 1 yıl önce

NVIDIA sent us 2 DGX Sparks. For a while we wondered what we would do with them. The memory bandwidth is 273GB/s making it 3x slower than an M3 Ultra (819GB/s) for batch_size=1 inference. But it has 4x more FLOPS (100 TFLOPS compared to 26 TFLOPS). So we thought, what if we could combine the DGX Spark & M3 Ultra, and make use of both the massive compute on the DGX Spark and the massive memory-bandwidth on the M3 Ultra. We came up with a way to split inference across both devices and achieve a speedup of up to 4x for long prompts compared to the M3 Ultra on its own. Full details in the blog post linked below.

NVIDIA sent us 2 DGX Sparks. For a while we wondered what we would do with them. The memory bandwidth is 273GB/s making it 3x slower than an M3 Ultra (819GB/s) for batch_size=1 inference. But it has 4x more FLOPS (100 TFLOPS compared to 26 TFLOPS). So we thought, what if we could combine the DGX Spark & M3 Ultra, and make use of both the massive compute on the DGX Spark and the massive memory-bandwidth on the M3 Ultra. We came up with a way to split inference across both devices and achieve a speedup of up to 4x for long prompts compared to the M3 Ultra on its own. Full details in the blog post linked below.

281,225 görüntüleme • 9 ay önce

Running Qwen3.6 35B (vision) on 2 x M5 Max MacBook Pro with RDMA over Thunderbolt 5. It describes the image and identifies Apple Park correctly, but misidentifies John Ternus as Jeff Williams. Near instant response with prefix caching.

Running Qwen3.6 35B (vision) on 2 x M5 Max MacBook Pro with RDMA over Thunderbolt 5. It describes the image and identifies Apple Park correctly, but misidentifies John Ternus as Jeff Williams. Near instant response with prefix caching.

99,474 görüntüleme • 3 ay önce

M4 Mac AI Coding Cluster Uses EXO Labs to run LLMs (here Qwen 2.5 Coder 32B at 18 tok/sec) distributed across 4 M4 Mac Minis (Thunderbolt 5 80Gbps) and a MacBook Pro M4 Max. Local alternative to Cursor (benchmark comparison soon).

M4 Mac AI Coding Cluster Uses EXO Labs to run LLMs (here Qwen 2.5 Coder 32B at 18 tok/sec) distributed across 4 M4 Mac Minis (Thunderbolt 5 80Gbps) and a MacBook Pro M4 Max. Local alternative to Cursor (benchmark comparison soon).

517,579 görüntüleme • 1 yıl önce

NVIDIA DGX Spark. World’s smallest AI supercomputer with 128GB memory

NVIDIA DGX Spark. World’s smallest AI supercomputer with 128GB memory

312,131 görüntüleme • 1 yıl önce

Running GLM-4.7-Flash with OpenCode locally on M4 Max MacBook Pro. 4-bit model runs at 82 tok/sec. Prefill will get ~4x faster with M5 Max MacBook Pro (~28 Jan). EXO will also support disaggregating prefill and decode across devices, e.g. DGX Spark.

Running GLM-4.7-Flash with OpenCode locally on M4 Max MacBook Pro. 4-bit model runs at 82 tok/sec. Prefill will get ~4x faster with M5 Max MacBook Pro (~28 Jan). EXO will also support disaggregating prefill and decode across devices, e.g. DGX Spark.

128,404 görüntüleme • 5 ay önce

Llama 3 running locally on iPhone with MLX Built by exo intern team Mohamed Baioumy h/t Awni Hannun MLX & Prince Canuma for the port

Llama 3 running locally on iPhone with MLX Built by exo intern team Mohamed Baioumy h/t Awni Hannun MLX & Prince Canuma for the port

295,820 görüntüleme • 2 yıl önce

Llama 3.1 70b beamed to my iPhone from my exo intern home AI cluster of 2 MacBooks and 1 Mac Studio My own private GPT-4 assistant at home / on the go

Llama 3.1 70b beamed to my iPhone from my exo intern home AI cluster of 2 MacBooks and 1 Mac Studio My own private GPT-4 assistant at home / on the go

243,431 görüntüleme • 2 yıl önce

Running Llama-3-70B at home with exo intern Combines the compute of all these devices to make one big GPU: - iPhone 15 Pro Max - iPad Pro M4 - Galaxy S24 Ultra - MacBook Pro M2 and M3 Pro - 2 x MSI NVIDIA GeForce RTX 4090 SUPRIM Code is open source 👇

Running Llama-3-70B at home with exo intern Combines the compute of all these devices to make one big GPU: - iPhone 15 Pro Max - iPad Pro M4 - Galaxy S24 Ultra - MacBook Pro M2 and M3 Pro - 2 x MSI NVIDIA GeForce RTX 4090 SUPRIM Code is open source 👇

197,476 görüntüleme • 2 yıl önce

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

Running GLM-4.7-Flash on 4 x M4 Pro Mac Minis using EXO Labs. Uses tensor parallelism with RDMA over Thunderbolt & MLX backend (h/t Awni Hannun). Runs at 100 tok/sec. We're working on optimizing this at EXO Labs. Aiming to hit ~200 tok/sec on this setup soon.

62,218 görüntüleme • 6 ay önce

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

Running GLM 4.7 Flash (8-bit) with Tensor Parallel / RDMA on 2 M4 Pro Mac Minis at 60 tok/sec. mlx-lm 0.30.5 features huge speedups for GLM 4.7 Flash for long context (h/t N8 Programs & Awni Hannun). M5 Pro (~28 Jan) will have ~4x faster prefill and ~1.3x faster decode.

56,555 görüntüleme • 5 ay önce

"Mac Minis for example are a very good fit" - Andrej Karpathy Andrej Karpathy shouted out my work on EXO Labs in his keynote at Y Combinator AI SUS! Here's the breakdown: Right now most AI workloads run in the cloud where requests from different users are continuously batched together. These workloads are FLOPS-bound and favors hardware with the best unit economics of $ per FLOP, i.e. enterprise GPUs. The personal computing revolution will shift these workloads to personal devices with lower batch sizes (mostly batch_size=1). batch_size=1 inference is memory-bound, because all of the model parameters need to be loaded into the GPU every time a token is generated. Apple Silicon with its Unified Memory architecture has a lot of memory and memory bandwidth per $ compared to other hardware: - M4 Pro Mac Mini, 24GB @ 273GB/s, $58.33/GB, $5.13/GB/s - H100, 80GB @ 3350GB/s, $625/GB, $14.93/GB/s The unit economics of Apple Silicon are becoming more compelling with every release of the Mac. The future of AI inference looks more like open weights models (OpenAI open weights model soonTM) run at low batch_size on personal devices.

"Mac Minis for example are a very good fit" - Andrej Karpathy Andrej Karpathy shouted out my work on EXO Labs in his keynote at Y Combinator AI SUS! Here's the breakdown: Right now most AI workloads run in the cloud where requests from different users are continuously batched together. These workloads are FLOPS-bound and favors hardware with the best unit economics of $ per FLOP, i.e. enterprise GPUs. The personal computing revolution will shift these workloads to personal devices with lower batch sizes (mostly batch_size=1). batch_size=1 inference is memory-bound, because all of the model parameters need to be loaded into the GPU every time a token is generated. Apple Silicon with its Unified Memory architecture has a lot of memory and memory bandwidth per $ compared to other hardware: - M4 Pro Mac Mini, 24GB @ 273GB/s, $58.33/GB, $5.13/GB/s - H100, 80GB @ 3350GB/s, $625/GB, $14.93/GB/s The unit economics of Apple Silicon are becoming more compelling with every release of the Mac. The future of AI inference looks more like open weights models (OpenAI open weights model soonTM) run at low batch_size on personal devices.

103,566 görüntüleme • 1 yıl önce

Thanks for the mention TBPN! We're working on making this run at >40 tok/sec on 2 Mac Studios (the full, unquantized Kimi K2.5). I'm already running Kimi K2.5 myself locally with Claude Code and it's impressive - very close to Opus level.

Thanks for the mention TBPN! We're working on making this run at >40 tok/sec on 2 Mac Studios (the full, unquantized Kimi K2.5). I'm already running Kimi K2.5 myself locally with Claude Code and it's impressive - very close to Opus level.

40,471 görüntüleme • 5 ay önce

Speed running my home AI cluster running distributed inference across 2 MacBooks and 2 Mac Minis. exo intern displays a real-time network topology as devices discover each other over the local network. Code is open source 👇

Speed running my home AI cluster running distributed inference across 2 MacBooks and 2 Mac Minis. exo intern displays a real-time network topology as devices discover each other over the local network. Code is open source 👇

110,105 görüntüleme • 2 yıl önce

Latest Fireship video featured my run of DeepSeek R1 on M4 Mac Minis. Apple Silicon dominates in memory/bw unit economics, ideal for huge MoE models like R1 at batch_size=1 (the real-world use-case). SOTA AI may come from China, but it will run on American hardware.

Latest Fireship video featured my run of DeepSeek R1 on M4 Mac Minis. Apple Silicon dominates in memory/bw unit economics, ideal for huge MoE models like R1 at batch_size=1 (the real-world use-case). SOTA AI may come from China, but it will run on American hardware.

Alex Cheema - e/acc

79,775 görüntüleme • 1 yıl önce

Daha fazla içerik yok.