
Anemll
@anemll • 4,113 subscribers
ANEMLL (pronounced like "animal") Artificial Neural Engine Machine Learning Library, Open Source Project
Shorts
Videos

Running 400B model on iPhone! 0.6 t/s Credit Dan Woods @alexintosh Daniel Isaac Anemll
Anemll291,014 görüntüleme • 2 ay önce

~ 6.5 - 6.7 t/s for GLM 5.1 on M5 Max 128GB Added “Dense” model export, now model load is only 5s ! Experts are streaming from SSD, so we do not pre-load it. Added direct SSD->Slot memory path, removed prefetch... Many dead end experiments. See Export a “dense-only GGUF” and “Fast path ” in tools/flashmob-sidecar/README.md WIP branch for Flash-MoE-SSD
Anemll29,421 görüntüleme • 1 ay önce

I’ve been asked if external SSD works ? Here is M4 Pro 24GB running MinMax 2.7 @ 7.7 fps Unsloth AI quant IQ2_XXS @ 73GB MOE_TOPK=4, --moe-slot-bank 48, It's using ACASIS USB4v2 80 Gbps enclosure with “budget” T710 1TB Gen5 SSD over TB5 connection. I’m also testing different enclosure and SSDs. It seems QD1/Q1T1 random access is the most critical. Llama.cpp experimental fork/branch with MinMax-2.7: Quants and benchmarks are here: ACASIS Official
Anemll18,597 görüntüleme • 1 ay önce

WIP: First attempt to speed up prefill for Flash-MoE. Original repo did token-by-token without streamed experts. Added: Batched linear attention + batched full attention (Flash Attention style) with custom Metal kernels. Without experts: 6.2x faster prefill (11 -> 68 tok/s) With experts at full-attn layers only: 1.9x faster (11 -> 20.5 tok/s) — same output quality Qwen3.5-397B, 4-bit, 209GB, M5 Max 128GB 1/3
Anemll19,547 görüntüleme • 2 ay önce
Daha fazla içerik yok.