Alex Cheema's banner
Alex Cheema's profile picture

Alex Cheema

@alexocheema47,956 subscribers

building @exolabs | prev @UniOfOxford We're hiring: https://t.co/UlkApFndnH

Shorts

Running Kimi K2.5 on my desk. Runs at 24 tok/sec with 2 x 512GB M3 Ultra Mac Studios connected with Thunderbolt 5 (RDMA) using EXO Labs / MLX backend. Yes, it can run clawdbot.

Running Kimi K2.5 on my desk. Runs at 24 tok/sec with 2 x 512GB M3 Ultra Mac Studios connected with Thunderbolt 5 (RDMA) using EXO Labs / MLX backend. Yes, it can run clawdbot.

3,032,836 次观看

M4 Mac Mini AI Cluster Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run LLMs distributed across 4 M4 Pro Mac Minis. The cluster is small (iPhone for reference). It’s running Nemotron 70B at 8 tok/sec and scales to Llama 405B (benchmarks soon).

M4 Mac Mini AI Cluster Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run LLMs distributed across 4 M4 Pro Mac Minis. The cluster is small (iPhone for reference). It’s running Nemotron 70B at 8 tok/sec and scales to Llama 405B (benchmarks soon).

3,515,852 次观看

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

1,934,415 次观看

Running DeepSeek R1 on my desk Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run the full (671B, 8-bit) DeepSeek R1 distributed across 2 M3 Ultra 512GB Mac Studios (1TB total Unified Memory). Runs at 11 tok/sec. Theoretical max is ~20 tok/sec.

Running DeepSeek R1 on my desk Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run the full (671B, 8-bit) DeepSeek R1 distributed across 2 M3 Ultra 512GB Mac Studios (1TB total Unified Memory). Runs at 11 tok/sec. Theoretical max is ~20 tok/sec.

992,060 次观看

35.9 tok/sec on Windows 98 🤯 This is a 260K LLM with Llama-architecture. We also tried out larger models. Results in the blog post.

35.9 tok/sec on Windows 98 🤯 This is a 260K LLM with Llama-architecture. We also tried out larger models. Results in the blog post.

569,282 次观看

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

527,894 次观看

Running the full GLM 4.7 (8-bit) on 2 x 512GB M3 Ultra Mac Studios Runs at 19.8 tok/sec with EXO Labs MLX RDMA backend (h/t Awni Hannun) & tensor parallel

Running the full GLM 4.7 (8-bit) on 2 x 512GB M3 Ultra Mac Studios Runs at 19.8 tok/sec with EXO Labs MLX RDMA backend (h/t Awni Hannun) & tensor parallel

156,630 次观看

"It's on my todo to build my own stack of apple minis and run Llama on my own little cluster on my desk." - Garry Tan Quick AI homelab tutorial: - Install EXO Labs on each mini (open source, steps in README) - Make sure all the minis are on the same WiFi network or for faster/more reliable results, connected over Thunderbolt/Ethernet. Devices will auto-discover each other and auto-shard the LLMs. - Now you can chat to your cluster with your preferred LLM GUI (the tiny corp tinychat ships with exo)

"It's on my todo to build my own stack of apple minis and run Llama on my own little cluster on my desk." - Garry Tan Quick AI homelab tutorial: - Install EXO Labs on each mini (open source, steps in README) - Make sure all the minis are on the same WiFi network or for faster/more reliable results, connected over Thunderbolt/Ethernet. Devices will auto-discover each other and auto-shard the LLMs. - Now you can chat to your cluster with your preferred LLM GUI (the tiny corp tinychat ships with exo)

243,248 次观看

Preparing to run the new Llama 3.3 70B on my mac cluster. Downloading model shards onto 3 x M4 Pro Mac Mini and 1 x M3 Max MacBook Pro. AI cluster is connected by Gigabit ethernet switch with EXO Labs

Preparing to run the new Llama 3.3 70B on my mac cluster. Downloading model shards onto 3 x M4 Pro Mac Mini and 1 x M3 Max MacBook Pro. AI cluster is connected by Gigabit ethernet switch with EXO Labs

181,559 次观看

M4 Mac Mini AI cluster running Llama 3.3 70B Preliminary results show Gigabit ethernet switch (4 tok/sec) is slower than Thunderbolt-5 with EXO Labs Benchmarks including Thunderbolt-5 coming soonTM.

M4 Mac Mini AI cluster running Llama 3.3 70B Preliminary results show Gigabit ethernet switch (4 tok/sec) is slower than Thunderbolt-5 with EXO Labs Benchmarks including Thunderbolt-5 coming soonTM.

73,090 次观看

Burning an LLM onto a CD. Guess why? Best answer gets a mac mini cake.

Burning an LLM onto a CD. Guess why? Best answer gets a mac mini cake.

17,612 次观看

Real-time distributed inference monitoring is live on exo intern home AI cluster It comes with out of the box support for PrometheusMonitoring metrics and Grafana dashboards Helps diagnose performance bottlenecks in your exo cluster

Real-time distributed inference monitoring is live on exo intern home AI cluster It comes with out of the box support for PrometheusMonitoring metrics and Grafana dashboards Helps diagnose performance bottlenecks in your exo cluster

17,798 次观看

Videos

没有更多内容可加载