Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

34,136 Aufrufe • vor 2 Monaten •via X (Twitter)

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

a new 8GB VRAM GPU dense Local LLM leader was born yesterday runs on: RTX 4060 / RTX 3070 / RTX 2080. any 8GB card Qwen 3.5 9B (dense) was the go to for 6-8GB VRAM builds. Gemma 4 12B QAT (dense) just changed that. same llama.cpp + cuda 13.2. i7 12700H. 16GB RAM. same -ngl 99 flags. same 48k context. unsloth gemma-4-12b-it-Q4_K_M.gguf → 15 tok/sec @ 48k ctx unsloth gemma-4-12B-it-qat-UD-Q4_K_XL.gguf → 32 tok/sec @ 48k ctx → 26 tok/sec @ 64k ctx 64k context is a big deal. Hermes 3 agent requires 64k minimum to run. you're now getting full hermes compatible context on a budget consumer GPU at 26 tok/sec locally. 2.1x faster on identical hardware. and here's the part that breaks your brain: the QAT-UD-Q4_K_XL is actually SMALLER than the Q4_K_M "XL" why? QAT = Quantization Aware Training Google didn't train the model first and compress it later they trained it to be quantized from day one the weights already know how to survive low precision that's why you get more quality per byte llamacpp flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 48000 -v fits in 8GB VRAM clean. no API. no cloud. no subscription. and this isn't even the MTP variant yet Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB. I have benchmarked the 26b and 31b qat as well on a single RTX 4090, checkout the comments for details. If you have a 6GB or 8GB VRAM GPU, post your numbers. more benchmarks and configs coming soon

Alok

259,993 Aufrufe • vor 24 Tagen