
left curve dev
@leftcurvedev_ • 6,210 subscribers
low iq, high vram — sharing local ai and coding stuff
Shorts
Videos

Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp Here are my results for Qwen3.6 35B A3B, with 64k q8_0 context on a 8GB RTX 3070Ti: ⚪️ no flag → 8.7 tok/s RAM: 13.6GB & VRAM: 7.8GB 🔴 -ncmoe 35 → 27.5 tok/s RAM: 12.1GB & VRAM: 4.3GB 🟢 -ncmoe 30 → 32.5 tok/s RAM: 12GB & VRAM: 5.6GB 🔵 -ncmoe 25 → 40.9 tok/s RAM: 12GB & VRAM: 6.9GB Please note the ram and vram usage you see are total usage of a windows pc, with the model running. My friend's setup: 8GB VRAM and 16GB RAM. You can boost performance by switching to Linux, just something to keep in mind. Basically, this flag keeps the MoE experts in the first X layers on your CPU + RAM, instead of eating all your VRAM straight away. This is a smart hybrid offload way that lets you run bigger models without OOM while keeping the rest on your GPU for speed. As we can see on the data, there's a sweet spot. When we lower it from 35 to 25, speed bumps +50% because there are more layers on your GPU (look at the VRAM usage). The key here is to play around with the number and fit as much as possible on your VRAM, goal is to have 1GB/800MB headroom to avoid stress. ↓ server flags below
left curve dev163,680 次观看 • 1 个月前

I nearly 2x'd the speed while only using +1GB VRAM with the new MTP update in llama.cpp 🤯 You need to add these flags to start using it: --spec-type draft-mtp \ --spec-draft-p-min 0.75 \ --spec-draft-n-max 2 My results with Qwen3.6 27B on a single RTX 5080 ↓ ⚪️ no flag (without mtp) → 54.3 tok/s with 13.26GB VRAM 🔵 --spec-draft-n-max 2 → 90.7 tok/s with 14.29GB VRAM 🔴 --spec-draft-n-max 2 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.30GB VRAM 🟢 --spec-draft-n-max 6 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.87GB VRAM Increasing to 6 draft tokens didn't help my setup for some reason. I made sure to test with a low context length to have enough headroom and eliminate risk of vram stress. From my understanding: 1) The speed gains are very task-dependent. You need to test across a wide range of tasks to get a realistic idea of the benefits 2) We’re already running heavily quantized GGUF models (Q3, Q4, Q6, etc.), so we already benefit from strong speed/performance thanks to the reduced size. That’s why some people are seeing little to no improvement compared to MLX or other quantized versions The progress over the past few days has been insane to say the least. However, MTP now consumes significantly more VRAM. Personally 16GB just isn't enough to use MTP and run it with a good context size. Time to upgrade lads, 24GB+ users are eating GOOD today 🔥 Full setup below ↓
left curve dev28,039 次观看 • 22 天前

I know some bros want to see what happens when we push the value even further, so here we go: ⚪️ -ncmoe 25 → 41.8 tok/s RAM: 12GB & VRAM: 6.9GB 🔴 -ncmoe 23 → 43.8 tok/s RAM: 12.2GB & VRAM: 7.4GB 🟢 -ncmoe 21 → 38.6 tok/s RAM: 12.4GB & VRAM: 7.8GB 🔵 -ncmoe 19 → 19.8 tok/s RAM: 13.8GB & VRAM: 7.8GB As you can see, there's a sweet spot with the VRAM usage. Play around and monitor to find the right value for your setup, you can use llama.cpp web ui to monitor speeds easily Sweet spot seems to be 25-23 for 8GB VRAM ✅ +40tok/s for Qwen3.6 35B with 64k q8_0 context on a 8GB card is very impressive just by using base llama.cpp, and we didn't even try Turboquant, MTP or Dflash yet! I'll focus on these next 👀 (Server flags and setup in the quoted tweet)
left curve dev18,049 次观看 • 1 个月前

🥊 Time for a new fight Qwen3.5 397B A17B vs Qwen3.6 27B 🌸 "Cherry Blossom" (↓ prompt below) Using OpenRouter for 397B Running 27B locally on a single RTX 5080 Wow! 🤯 I'm sure this one is going to divide opinions. 397B feels cinematic with the flare and shadows it added, while 27B is just beautiful; it has a completely different color palette and the leaves are falling from the branches as asked. It's really tough to decide which model is the best. I was about to call it a tie, but then I remembered that 397B needs 180GB VRAM to run locally 😅 Don't get me wrong the model is amazing, but right now the VRAM it requires doesn't feel justified. It's simply not 10× better than 27B. We'll have to dig deeper with more prompts to be sure, let's see what each one has in store for us. What do you think?
left curve dev21,075 次观看 • 1 个月前
没有更多内容可加载