Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

🔥 Thrilled to have worked with Google AI Developers on day-0 MLX support for Gemma 3 QAT! QAT optimizes models during training by simulating low-precision operations, delivering similar performance to FP16 and dramatic memory savings when quantised: • Gemma 3 27B: 54GB → 14.1GB (74% reduction) • Gemma 3...

11,374 Aufrufe • vor 1 Jahr •via X (Twitter)

10 Kommentare

Profilbild von Pwnage
Pwnagevor 1 Jahr

Stormbreaker Max CF design to production Get the next generation of high end performance gaming mice. Shop now:

Profilbild von Prince Canuma
Prince Canumavor 1 Jahr

Thanks @osanseviero @reach_vb and the teams behind this amazing release ❤️

Profilbild von Mag Mario Jembrih
Mag Mario Jembrihvor 1 Jahr

@googleaidevs not seeing it in LM Studio, will it show up there too?

Profilbild von Prince Canuma
Prince Canumavor 1 Jahr

@googleaidevs Probably an update? @yagilb

Profilbild von Q
Qvor 1 Jahr

@googleaidevs Thank you for the efforts, also the Kimi thinking was great and fast but had some registration problmes needed to be bypassed with custom code, if not wrong still exist. But cheers for the efforts and speed.

Profilbild von Prince Canuma
Prince Canumavor 1 Jahr

@googleaidevs Could you share more about it? Perhaps open an issue

Profilbild von Ljubomir Josifovski
Ljubomir Josifovskivor 1 Jahr

Excellent stuff! About double the speed of a prior model of similar size, from recollection. 🤩Thanks for that! 🙏 Would you know if we are to expect Speculative Decoding in @lmstudio to work? I got the 27b, then downloaded the 1b and then 4b versions too. Trying to get them to show up in the "Speculative Decoding" "Draft Model" "Select a compatible draft model" dropdown. So far no luck, none of them show up in the dropdown in @lmstudio. (on m2 mbp 96gb ram) (pic of the exact models versions below)

Profilbild von Prince Canuma
Prince Canumavor 1 Jahr

@googleaidevs @lmstudio Not yet for VLMs only if you use them as text models

Profilbild von Jikku Jose
Jikku Josevor 1 Jahr

@googleaidevs Wish there was an intermediate model between 27B & 12B, lots of cards in that range!

Profilbild von Joe Burnett
Joe Burnettvor 1 Jahr

@googleaidevs I can’t wait to check it out!

Ähnliche Videos

a new 8GB VRAM GPU dense Local LLM leader was born yesterday runs on: RTX 4060 / RTX 3070 / RTX 2080. any 8GB card Qwen 3.5 9B (dense) was the go to for 6-8GB VRAM builds. Gemma 4 12B QAT (dense) just changed that. same llama.cpp + cuda 13.2. i7 12700H. 16GB RAM. same -ngl 99 flags. same 48k context. unsloth gemma-4-12b-it-Q4_K_M.gguf → 15 tok/sec @ 48k ctx unsloth gemma-4-12B-it-qat-UD-Q4_K_XL.gguf → 32 tok/sec @ 48k ctx → 26 tok/sec @ 64k ctx 64k context is a big deal. Hermes 3 agent requires 64k minimum to run. you're now getting full hermes compatible context on a budget consumer GPU at 26 tok/sec locally. 2.1x faster on identical hardware. and here's the part that breaks your brain: the QAT-UD-Q4_K_XL is actually SMALLER than the Q4_K_M "XL" why? QAT = Quantization Aware Training Google didn't train the model first and compress it later they trained it to be quantized from day one the weights already know how to survive low precision that's why you get more quality per byte llamacpp flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 48000 -v fits in 8GB VRAM clean. no API. no cloud. no subscription. and this isn't even the MTP variant yet Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB. I have benchmarked the 26b and 31b qat as well on a single RTX 4090, checkout the comments for details. If you have a 6GB or 8GB VRAM GPU, post your numbers. more benchmarks and configs coming soon

Alok

259,993 Aufrufe • vor 25 Tagen