Loading video...

Video Failed to Load

Go Home

Sentra just killed Google Research's TurboQuant. SpectralQuant — 5.95× KV cache compression on Mistral 7B at +7.5% perplexity overhead. TurboQuant at the same compression: +22%. 3× less degradation. 15-second calibration. One per-model, then drop-in for any HuggingFace LLM, ViT, ESM, AlphaFold Evoformer, or VideoMAE. Check out the findings and...

59,538 views • 1 month ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Alok

119,821 views • 15 days ago

andrej karpathy spent two hours teaching one thing: tokens are the atom of llms. tokenization is at the heart of every llm weirdness you've ever debugged. [watch the 15-min clip below. then run the 7-day playbook] ↓ save this before everyone copies it learn how the tokenizer works. understand how your llm actually consumes input. then run the engineering roadmap that took one production agent from $4,800/mo to $620/mo in 7 days. 87% reduction. no model swap. no framework migration. no quality drop on the eval set. token cost in 2026 is an engineering discipline. every line of your system prompt is rent you pay forever. what was eating the budget: → a single forgotten cron job ate 47% of one team's bill. they turned it off on a tuesday and the bill dropped before they wrote any optimization code. → anthropic ships a 90% discount on cache reads. one config line, cache_control ephemeral, break-even after one hit. most teams cache the volatile parts of the prompt and watch their hit rate sit at 12%. → one production agent went from 14,500 tokens of context overhead per turn to 850. a 94% drop. output quality held within 2% of the uncompressed baseline. → 60% of agent calls are haiku-tier work running on opus rates. classify the task first. pick the model second. → retry loops are the silent killer. no MAX_STEPS bound, one bad search query, $14 burned in a single session. one team traced 38% of their bill to this single pattern. karpathy gave you the atom. the playbook below gives you the harness. watch the lecture. read the playbook ↓

Rohit

73,031 views • 1 month ago