Anemll's banner

Anemll

@anemll • 4,335 subscribers

ANEMLL (pronounced like "animal") Artificial Neural Engine Machine Learning Library, Open Source Project

Shorts

Qwen 3.5 0.8B, Gated DeltaNet attention is running on Apple Neural Engine ~56 t/s in LUT6 quantization with some room for optimization left. It is CoreML, Swift and IOSurface on M4Pro. It will slow down as we increase context, but not by much. I think Private API opens the way to integrate ANE with GPU/MLX and possibly some MoE.

Qwen 3.5 0.8B, Gated DeltaNet attention is running on Apple Neural Engine ~56 t/s in LUT6 quantization with some room for optimization left. It is CoreML, Swift and IOSurface on M4Pro. It will slow down as we increase context, but not by much. I think Private API opens the way to integrate ANE with GPU/MLX and possibly some MoE.

13,589 görüntüleme

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Running 400B model on iPhone! 0.6 t/s Credit Dan Woods @alexintosh Daniel Isaac Anemll

Running 400B model on iPhone! 0.6 t/s Credit Dan Woods @alexintosh Daniel Isaac Anemll

292,728 görüntüleme • 4 ay önce

Here is gemma-4-26B-A4B-it on A17 Pro chip w/8GB memory ( MacBook Neo) ~ 7 t/s running on AMX ( GPU is slower on A17) Gemma's 4 expert is x2.3 larger than Qwen See Qwen 35B below

Here is gemma-4-26B-A4B-it on A17 Pro chip w/8GB memory ( MacBook Neo) ~ 7 t/s running on AMX ( GPU is slower on A17) Gemma's 4 expert is x2.3 larger than Qwen See Qwen 35B below

142,797 görüntüleme • 3 ay önce

Initial DeepSeek v4 and v4 Flash support added to anemll-flash-llama.cpp! 🚀 M5 MAX 128GB can run full DeepSeek-V4 with 1.6T params in original FP8/FP4 weights from SSD without requantization! One-click scripts to convert HF safetensors to dense + MoE sidecar (no GGUF needed) Inference and server examples Prefill and decoding benchmarks Branch and docs here: This is WIP and I plan to add cleanup and more testing this week. Feedback welcome!

Initial DeepSeek v4 and v4 Flash support added to anemll-flash-llama.cpp! 🚀 M5 MAX 128GB can run full DeepSeek-V4 with 1.6T params in original FP8/FP4 weights from SSD without requantization! One-click scripts to convert HF safetensors to dense + MoE sidecar (no GGUF needed) Inference and server examples Prefill and decoding benchmarks Branch and docs here: This is WIP and I plan to add cleanup and more testing this week. Feedback welcome!

32,085 görüntüleme • 2 ay önce

Added few tweaks to alexintosh iOS port. ~7 t/s on iPhone 17pro Also tested on Neo, looks like A19pro SSD is way faster! The same prompt…

Added few tweaks to alexintosh iOS port. ~7 t/s on iPhone 17pro Also tested on Neo, looks like A19pro SSD is way faster! The same prompt…

40,461 görüntüleme • 4 ay önce

Apple’s “LLM in a Flash” is definitely worth checking out. Going to 2-bit for the shared-expert MLP means disk I/O is no longer dominant. 14–15 tok/s from SSD is still wild for a ~400B MoE model streamed from storage. Qwen3.5-397B-A17B Credit: Dan Woods

Apple’s “LLM in a Flash” is definitely worth checking out. Going to 2-bit for the shared-expert MLP means disk I/O is no longer dominant. 14–15 tok/s from SSD is still wild for a ~400B MoE model streamed from storage. Qwen3.5-397B-A17B Credit: Dan Woods

35,664 görüntüleme • 4 ay önce

~ 6.5 - 6.7 t/s for GLM 5.1 on M5 Max 128GB Added “Dense” model export, now model load is only 5s ! Experts are streaming from SSD, so we do not pre-load it. Added direct SSD->Slot memory path, removed prefetch... Many dead end experiments. See Export a “dense-only GGUF” and “Fast path ” in tools/flashmob-sidecar/README.md WIP branch for Flash-MoE-SSD

~ 6.5 - 6.7 t/s for GLM 5.1 on M5 Max 128GB Added “Dense” model export, now model load is only 5s ! Experts are streaming from SSD, so we do not pre-load it. Added direct SSD->Slot memory path, removed prefetch... Many dead end experiments. See Export a “dense-only GGUF” and “Fast path ” in tools/flashmob-sidecar/README.md WIP branch for Flash-MoE-SSD

29,421 görüntüleme • 3 ay önce

New mactop w/ANE readout for M5 of macOS 27 Confirmed! Just upgraded M5 MAX, it need something different to get ANE :( Looking into it If you have M3U upgraded to 27 please test Ivan Fioravanti ᯅ Simple command like this triggers ANE: fm respond --model system --stream 'make a game of Space invaders in PyGame'

New mactop w/ANE readout for M5 of macOS 27 Confirmed! Just upgraded M5 MAX, it need something different to get ANE :( Looking into it If you have M3U upgraded to 27 please test Ivan Fioravanti ᯅ Simple command like this triggers ANE: fm respond --model system --stream 'make a game of Space invaders in PyGame'

12,608 görüntüleme • 1 ay önce

Ported OpenClaw🦞 to run on iOS, tvOS, and VisionOS. No server needed. Still need to use LLM provider though…Next step: local models… Hope Apple approves the test flight. Will clean up the repo tomorrow to share.

Ported OpenClaw🦞 to run on iOS, tvOS, and VisionOS. No server needed. Still need to use LLM provider though…Next step: local models… Hope Apple approves the test flight. Will clean up the repo tomorrow to share.

33,840 görüntüleme • 5 ay önce

I’ve been asked if external SSD works ? Here is M4 Pro 24GB running MinMax 2.7 @ 7.7 fps Unsloth AI quant IQ2_XXS @ 73GB MOE_TOPK=4, --moe-slot-bank 48, It's using ACASIS USB4v2 80 Gbps enclosure with “budget” T710 1TB Gen5 SSD over TB5 connection. I’m also testing different enclosure and SSDs. It seems QD1/Q1T1 random access is the most critical. Llama.cpp experimental fork/branch with MinMax-2.7: Quants and benchmarks are here: ACASIS Official

I’ve been asked if external SSD works ? Here is M4 Pro 24GB running MinMax 2.7 @ 7.7 fps Unsloth AI quant IQ2_XXS @ 73GB MOE_TOPK=4, --moe-slot-bank 48, It's using ACASIS USB4v2 80 Gbps enclosure with “budget” T710 1TB Gen5 SSD over TB5 connection. I’m also testing different enclosure and SSDs. It seems QD1/Q1T1 random access is the most critical. Llama.cpp experimental fork/branch with MinMax-2.7: Quants and benchmarks are here: ACASIS Official

18,597 görüntüleme • 3 ay önce

WIP: First attempt to speed up prefill for Flash-MoE. Original repo did token-by-token without streamed experts. Added: Batched linear attention + batched full attention (Flash Attention style) with custom Metal kernels. Without experts: 6.2x faster prefill (11 -> 68 tok/s) With experts at full-attn layers only: 1.9x faster (11 -> 20.5 tok/s) — same output quality Qwen3.5-397B, 4-bit, 209GB, M5 Max 128GB 1/3

WIP: First attempt to speed up prefill for Flash-MoE. Original repo did token-by-token without streamed experts. Added: Batched linear attention + batched full attention (Flash Attention style) with custom Metal kernels. Without experts: 6.2x faster prefill (11 -> 68 tok/s) With experts at full-attn layers only: 1.9x faster (11 -> 20.5 tok/s) — same output quality Qwen3.5-397B, 4-bit, 209GB, M5 Max 128GB 1/3

19,572 görüntüleme • 3 ay önce

Daha fazla içerik yok.