Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Sentra just killed Google Research's TurboQuant. SpectralQuant — 5.95× KV cache compression on Mistral 7B at +7.5% perplexity overhead. TurboQuant at the same compression: +22%. 3× less degradation. 15-second calibration. One per-model, then drop-in for any HuggingFace LLM, ViT, ESM, AlphaFold Evoformer, or VideoMAE. Check out the findings and... show more

Ashwin Gopinath

6,315 subscribers

59,538 views • 1 month ago •via X (Twitter)

News & Politics Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

i just beat Google DeepMind's turboquant introducing Shard. 10x KV cache compression on Llama-3.1-8B. zero quality loss - 10x @ 8K context, 11.2x @ 32K - NIAH recall 1.000 across 4K-32K - LongBench Δ ≈ 0 vs FP16 turboquant tops out at 4-6x at the same quality. we doubled it. read more: Kirri

i just beat Google DeepMind's turboquant introducing Shard. 10x KV cache compression on Llama-3.1-8B. zero quality loss - 10x @ 8K context, 11.2x @ 32K - NIAH recall 1.000 across 4K-32K - LongBench Δ ≈ 0 vs FP16 turboquant tops out at 4-6x at the same quality. we doubled it. read more: Kirri

Krish

154,692 views • 1 month ago

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA loads Try it out: s/o Bryce, the CUDA Colonel and the cuTile team at NVIDIA for lending me Blackwell GPU access :) cc sunny madra Gavin

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA loads Try it out: s/o Bryce, the CUDA Colonel and the cuTile team at NVIDIA for lending me Blackwell GPU access :) cc sunny madra Gavin

ani

807,103 views • 3 months ago

Your local AI just got up to 5x more memory. Same model. Same device. Nearly zero accuracy loss. QVAC SDK 0.12.0 integrates TurboQuant - Google Research's latest memory optimisation algorithm. What is TurboQuant? The KV cache is the memory your model uses to track a conversation. As context grows, it fills up fast. 32K tokens. 64K. Game over. TurboQuant compresses it up to 5x with no accuracy loss. What does it unlock for you? Your app had a 16K token ceiling? It's now 96K. On the same device. Just update the QVAC SDK to get up to 5x more efficiency. No code changes. All from one SDK. The TurboQuant integration unlocks sovereign intelligence for more people, on more devices. Learn more →

Your local AI just got up to 5x more memory. Same model. Same device. Nearly zero accuracy loss. QVAC SDK 0.12.0 integrates TurboQuant - Google Research's latest memory optimisation algorithm. What is TurboQuant? The KV cache is the memory your model uses to track a conversation. As context grows, it fills up fast. 32K tokens. 64K. Game over. TurboQuant compresses it up to 5x with no accuracy loss. What does it unlock for you? Your app had a 16K token ceiling? It's now 96K. On the same device. Just update the QVAC SDK to get up to 5x more efficiency. No code changes. All from one SDK. The TurboQuant integration unlocks sovereign intelligence for more people, on more devices. Learn more →

QVAC

15,799,195 views • 1 month ago

Yesterday we announced that the QVAC SDK update unlocked up to 5x more context on your device thanks to TurboQuant. Today, we’ll go through how we got there. TurboQuant (Google Research, ICLR 2026) is a two-stage KV-cache compression algorithm. Stage 1 - PolarQuant: convert KV vectors from Cartesian (x, y, z...) to polar coordinates. Angles compress predictably down to 3-4 bits. Stage 2 - QJL: 1-bit Johnson-Lindenstrauss correction. Cleans up residual error. Total: ~4-5 bits per value. No retraining. No calibration. QVAC ported it to Vulkan inside qvac-fabric-llm.cpp. Currently, TurboQuant is supported only for AMD & NVIDIA GPUs, support for iOS, Android & Apple Silicon coming next. Full algorithm walkthrough + benchmarks + code examples →

Yesterday we announced that the QVAC SDK update unlocked up to 5x more context on your device thanks to TurboQuant. Today, we’ll go through how we got there. TurboQuant (Google Research, ICLR 2026) is a two-stage KV-cache compression algorithm. Stage 1 - PolarQuant: convert KV vectors from Cartesian (x, y, z...) to polar coordinates. Angles compress predictably down to 3-4 bits. Stage 2 - QJL: 1-bit Johnson-Lindenstrauss correction. Cleans up residual error. Total: ~4-5 bits per value. No retraining. No calibration. QVAC ported it to Vulkan inside qvac-fabric-llm.cpp. Currently, TurboQuant is supported only for AMD & NVIDIA GPUs, support for iOS, Android & Apple Silicon coming next. Full algorithm walkthrough + benchmarks + code examples →

QVAC

14,467,728 views • 1 month ago

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Alok

119,821 views • 15 days ago

We’re thrilled to open-source TriAttention! 🚀 🦞 Deploy OpenClaw (32B LLM) on a single 24GB RTX 4090 locally 💻Full code open-source & vLLM-ready for one-click deployment ⚡️ 2.5× faster inference speed & 10.7× less KV cache memory usage TriAttention is a novel KV cache compression method built on rigorous trigonometric analysis in the Pre‑RoPE space for efficient LLM long reasoning. Github Repo: Paper Link: Homepage:

We’re thrilled to open-source TriAttention! 🚀 🦞 Deploy OpenClaw (32B LLM) on a single 24GB RTX 4090 locally 💻Full code open-source & vLLM-ready for one-click deployment ⚡️ 2.5× faster inference speed & 10.7× less KV cache memory usage TriAttention is a novel KV cache compression method built on rigorous trigonometric analysis in the Pre‑RoPE space for efficient LLM long reasoning. Github Repo: Paper Link: Homepage:

Yukang Chen

197,268 views • 2 months ago

I used NotebookLM to study Google new breakthrough with TurboQuant and used Video overview to study the subject, best learning tool in the world at the moment. TurboQuant: Redefining AI Efficiency with Extreme Compression Google Research has introduced TurboQuant, a suite of advanced algorithms designed to dramatically compress the data used by large language models and search engines. By utilizing specialized techniques like PolarQuant and Quantized Johnson-Lindenstrauss, the system transforms complex information into a compact "shorthand" that requires significantly less memory. This innovation specifically addresses the key-value cache bottleneck, allowing AI models to process massive amounts of text faster without losing accuracy. Testing demonstrates that these methods can shrink data size by six times while actually increasing operational speed on high-end hardware. Ultimately, these theoretical advancements enable more efficient semantic search and high-performance AI applications at a global scale.

I used NotebookLM to study Google new breakthrough with TurboQuant and used Video overview to study the subject, best learning tool in the world at the moment. TurboQuant: Redefining AI Efficiency with Extreme Compression Google Research has introduced TurboQuant, a suite of advanced algorithms designed to dramatically compress the data used by large language models and search engines. By utilizing specialized techniques like PolarQuant and Quantized Johnson-Lindenstrauss, the system transforms complex information into a compact "shorthand" that requires significantly less memory. This innovation specifically addresses the key-value cache bottleneck, allowing AI models to process massive amounts of text faster without losing accuracy. Testing demonstrates that these methods can shrink data size by six times while actually increasing operational speed on high-end hardware. Ultimately, these theoretical advancements enable more efficient semantic search and high-performance AI applications at a global scale.

Emily

14,648 views • 3 months ago

3Blue1Brown’s new video explains why every LLM is actually a compression machine. everyone describes pre-training as “next token prediction” but that’s just the surface-level objective. in reality it is a means to making the most efficient text compressor. prediction and compression are two sides of the same coin. when you train the model to predict the next token you’re not just teaching it to guess the next word but how to best encode the human knowledge it sees. better compression means better abstraction means better reasoning at some point, compression stops looking like storage or a database (as some like to call it on X) and looks like an approximation of understanding.

3Blue1Brown’s new video explains why every LLM is actually a compression machine. everyone describes pre-training as “next token prediction” but that’s just the surface-level objective. in reality it is a means to making the most efficient text compressor. prediction and compression are two sides of the same coin. when you train the model to predict the next token you’re not just teaching it to guess the next word but how to best encode the human knowledge it sees. better compression means better abstraction means better reasoning at some point, compression stops looking like storage or a database (as some like to call it on X) and looks like an approximation of understanding.

ℏεsam

119,233 views • 24 days ago

I can't delay the launch any longer! Introducing URL Monitor... 🎉 Track your URLs in Google, monitor potential issues and index new pages in bulk. Check out the video below (or visit the site and view it on the homepage for more details on how it works)

I can't delay the launch any longer! Introducing URL Monitor... 🎉 Track your URLs in Google, monitor potential issues and index new pages in bulk. Check out the video below (or visit the site and view it on the homepage for more details on how it works)

Ian Nuttall

63,051 views • 2 years ago

So Google TurboQuant is basically Pied Piper and just hit a Weismann Score of 5.2

So Google TurboQuant is basically Pied Piper and just hit a Weismann Score of 5.2

K A L E O

476,874 views • 3 months ago

With Chatbot UI you can use 100+ models in the same chat experience. Watch me use Mistral 8x7b via Groq and then switch to the new Claude 3 Opus. Any model. One interface.

With Chatbot UI you can use 100+ models in the same chat experience. Watch me use Mistral 8x7b via Groq and then switch to the new Claude 3 Opus. Any model. One interface.

Mckay Wrigley

104,921 views • 2 years ago

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

52,588 views • 2 months ago

LLM inference speed with vs. without KV caching: (learn how and why it works below)

LLM inference speed with vs. without KV caching: (learn how and why it works below)

Avi Chawla

395,064 views • 3 months ago

LLM inference speed with vs. without KV caching: (learn how and why it works below)

LLM inference speed with vs. without KV caching: (learn how and why it works below)

Daily Dose of Data Science

59,218 views • 2 months ago

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

Red Hat AI

34,136 views • 2 months ago

.Gareth Soloway called the Micron Technology $MU correction to the dot. The interview aired on March 18, right at the top, when Gareth revealed a short position on the Micron stock and since then, it's fallen 25%. Some reports #Google 's TurboQuant algos, which reduces memory consumption of #AI models, may be the cause

.Gareth Soloway called the Micron Technology $MU correction to the dot. The interview aired on March 18, right at the top, when Gareth revealed a short position on the Micron stock and since then, it's fallen 25%. Some reports #Google 's TurboQuant algos, which reduces memory consumption of #AI models, may be the cause

David Lin

26,067 views • 3 months ago

andrej karpathy spent two hours teaching one thing: tokens are the atom of llms. tokenization is at the heart of every llm weirdness you've ever debugged. [watch the 15-min clip below. then run the 7-day playbook] ↓ save this before everyone copies it learn how the tokenizer works. understand how your llm actually consumes input. then run the engineering roadmap that took one production agent from $4,800/mo to $620/mo in 7 days. 87% reduction. no model swap. no framework migration. no quality drop on the eval set. token cost in 2026 is an engineering discipline. every line of your system prompt is rent you pay forever. what was eating the budget: → a single forgotten cron job ate 47% of one team's bill. they turned it off on a tuesday and the bill dropped before they wrote any optimization code. → anthropic ships a 90% discount on cache reads. one config line, cache_control ephemeral, break-even after one hit. most teams cache the volatile parts of the prompt and watch their hit rate sit at 12%. → one production agent went from 14,500 tokens of context overhead per turn to 850. a 94% drop. output quality held within 2% of the uncompressed baseline. → 60% of agent calls are haiku-tier work running on opus rates. classify the task first. pick the model second. → retry loops are the silent killer. no MAX_STEPS bound, one bad search query, $14 burned in a single session. one team traced 38% of their bill to this single pattern. karpathy gave you the atom. the playbook below gives you the harness. watch the lecture. read the playbook ↓

andrej karpathy spent two hours teaching one thing: tokens are the atom of llms. tokenization is at the heart of every llm weirdness you've ever debugged. [watch the 15-min clip below. then run the 7-day playbook] ↓ save this before everyone copies it learn how the tokenizer works. understand how your llm actually consumes input. then run the engineering roadmap that took one production agent from $4,800/mo to $620/mo in 7 days. 87% reduction. no model swap. no framework migration. no quality drop on the eval set. token cost in 2026 is an engineering discipline. every line of your system prompt is rent you pay forever. what was eating the budget: → a single forgotten cron job ate 47% of one team's bill. they turned it off on a tuesday and the bill dropped before they wrote any optimization code. → anthropic ships a 90% discount on cache reads. one config line, cache_control ephemeral, break-even after one hit. most teams cache the volatile parts of the prompt and watch their hit rate sit at 12%. → one production agent went from 14,500 tokens of context overhead per turn to 850. a 94% drop. output quality held within 2% of the uncompressed baseline. → 60% of agent calls are haiku-tier work running on opus rates. classify the task first. pick the model second. → retry loops are the silent killer. no MAX_STEPS bound, one bad search query, $14 burned in a single session. one team traced 38% of their bill to this single pattern. karpathy gave you the atom. the playbook below gives you the harness. watch the lecture. read the playbook ↓

Rohit

73,031 views • 1 month ago

🚨 | Lewis Hamilton on Mercedes’ compression ratio: “What’s clear is that they didn’t show the engine power through any of the practice because of the whole talk on the compression issue. They’ve done a really solid job with their engine, which we have as well but I’m trying to understand why it's 2 tenths or more just through power per sector... And so, if it is the compression thing, I wanna understand why the FIA haven't done anything [or] what’s being done to rectify it.”

🚨 | Lewis Hamilton on Mercedes’ compression ratio: “What’s clear is that they didn’t show the engine power through any of the practice because of the whole talk on the compression issue. They’ve done a really solid job with their engine, which we have as well but I’m trying to understand why it's 2 tenths or more just through power per sector... And so, if it is the compression thing, I wanna understand why the FIA haven't done anything [or] what’s being done to rectify it.”

deni

959,964 views • 3 months ago

Perplexity Finance is now available on the Perplexity iOS and Android apps. Just search for any stock ticker or type "Finance" in the mobile app.

Perplexity Finance is now available on the Perplexity iOS and Android apps. Just search for any stock ticker or type "Finance" in the mobile app.

Perplexity Finance

235,927 views • 9 months ago

Mistral OCR 3 now returns confidence scores, at page level or word level. Know exactly how sure the model is about what it extracted. Available now via API and to test in Mistral Studio.

Mistral OCR 3 now returns confidence scores, at page level or word level. Know exactly how sure the model is about what it extracted. Available now via API and to test in Mistral Studio.

Mistral AI for Developers

29,944 views • 2 months ago