left curve dev

@leftcurvedev_ • 6,251 subscribers

low iq, high vram — sharing local ai and coding stuff

Shorts

local ai haters in disbelief right now

172,261 次观看

So right now basically anyone with a 16GB VRAM card can go on Hugging Face and download a model which BEATS Claude Sonnet 4.5, all running locally 🤯 LOOK AT THE WATER PARTICLES?? What are they doing in these labs man What is this Unsloth AI UD quantization magic?? I'm running the lowest Q3 here, it oneshots everything 😭 Bros what is happening?

90,204 次观看

Qwen3.6 35B A3B vs Claude Sonnet 4.5 Made them fight on the same prompt. 🌸 "Cherry blossom" challenge Both models took exactly 41s to deliver, no retries Oh boy 🤯 Crazier part is that I used Unsloth AI's UD-IQ3_XXS gguf to make it fit in 16GB VRAM...

69,512 次观看

🥊 Qwen3.6 35B A3B vs Qwen3.6 27B Made them fight on the same prompt 🌊 Ocean Waves, a canvas challenge 35B took 39.7s (at 142tok/s) 27B took 111.6s (at 50tok/s) Both models are really good. The fact that the 35B produced such a strong result in under 40 seconds is seriously impressive. But when you look at 27B's output... it's actually so much better Added clouds, beautiful foam particles and splashing effects, a perfectly straight horizon, more realistic blinking stars, and a nice sun reflection on the water 🏆 Qwen3.6 27B slower but the result slaps Absolutely amazing that both models run smoothly on 16GB VRAM. Used Unsloth AI GGUFs here, Q3_K_S quant to make it fit in the hardware 🔥

60,705 次观看

Qwen3.6 35B A3B vs Claude Sonnet 4.5 🎸 Guitar Hero, another HTML canvas challenge Went pretty hard on this one ngl, there are a lot of elements and visual effects to stitch together. Best result I could get out of 5 runs on both Qwen set the stage on fire 🤘🔥 ↓ prompt below

50,024 次观看

Videos

LIVE

1.2k

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Streaming Now

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

HD live stream

Exclusive private shows

1.2k viewers online

Current Status

Live

Private Show

Join now for exclusive access

Free preview available • Premium content

0:31

2026 is the year of local AI

left curve dev

230,706 次观看 • 2 个月前

0:29

it's going to get harder to justify a $200 claude sub in the next few months

left curve dev

229,555 次观看 • 2 个月前

0:37

Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp Here are my results for Qwen3.6 35B A3B, with 64k q8_0 context on a 8GB RTX 3070Ti: ⚪️ no flag → 8.7 tok/s RAM: 13.6GB & VRAM: 7.8GB 🔴 -ncmoe 35 → 27.5 tok/s RAM: 12.1GB & VRAM: 4.3GB 🟢 -ncmoe 30 → 32.5 tok/s RAM: 12GB & VRAM: 5.6GB 🔵 -ncmoe 25 → 40.9 tok/s RAM: 12GB & VRAM: 6.9GB Please note the ram and vram usage you see are total usage of a windows pc, with the model running. My friend's setup: 8GB VRAM and 16GB RAM. You can boost performance by switching to Linux, just something to keep in mind. Basically, this flag keeps the MoE experts in the first X layers on your CPU + RAM, instead of eating all your VRAM straight away. This is a smart hybrid offload way that lets you run bigger models without OOM while keeping the rest on your GPU for speed. As we can see on the data, there's a sweet spot. When we lower it from 35 to 25, speed bumps +50% because there are more layers on your GPU (look at the VRAM usage). The key here is to play around with the number and fit as much as possible on your VRAM, goal is to have 1GB/800MB headroom to avoid stress. ↓ server flags below

left curve dev

166,063 次观看 • 2 个月前

$I nearly 2x'd the speed while only using +1GB VRAM with the new MTP update in llama.cpp 🤯 You need to add these flags to start using it: --spec-type draft-mtp \ --spec-draft-p-min 0.75 \ --spec-draft-n-max 2 My results with Qwen3.6 27B on a single RTX 5080 ↓ ⚪️ no flag (without mtp) → 54.3 tok/s with 13.26GB VRAM 🔵 --spec-draft-n-max 2 → 90.7 tok/s with 14.29GB VRAM 🔴 --spec-draft-n-max 2 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.30GB VRAM 🟢 --spec-draft-n-max 6 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.87GB VRAM Increasing to 6 draft tokens didn't help my setup for some reason. I made sure to test with a low context length to have enough headroom and eliminate risk of vram stress. From my understanding: 1) The speed gains are very task-dependent. You need to test across a wide range of tasks to get a realistic idea of the benefits 2) We’re already running heavily quantized GGUF models (Q3, Q4, Q6, etc.), so we already benefit from strong speed/performance thanks to the reduced size. That’s why some people are seeing little to no improvement compared to MLX or other quantized versions The progress over the past few days has been insane to say the least. However, MTP now consumes significantly more VRAM. Personally 16GB just isn't enough to use MTP and run it with a good context size. Time to upgrade lads, 24GB+ users are eating GOOD today 🔥 Full setup below ↓$