Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

i just beat Google DeepMind's turboquant introducing Shard. 10x KV cache compression on Llama-3.1-8B. zero quality loss - 10x @ 8K context, 11.2x @ 32K - NIAH recall 1.000 across 4K-32K - LongBench Δ ≈ 0 vs FP16 turboquant tops out at 4-6x at the same quality. we doubled... show more

Krish

1,836 subscribers

154,488 Aufrufe • vor 1 Monat •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Sentra just killed Google Research's TurboQuant. SpectralQuant — 5.95× KV cache compression on Mistral 7B at +7.5% perplexity overhead. TurboQuant at the same compression: +22%. 3× less degradation. 15-second calibration. One per-model, then drop-in for any HuggingFace LLM, ViT, ESM, AlphaFold Evoformer, or VideoMAE. Check out the findings and how the mechanism works below. ↓

Sentra just killed Google Research's TurboQuant. SpectralQuant — 5.95× KV cache compression on Mistral 7B at +7.5% perplexity overhead. TurboQuant at the same compression: +22%. 3× less degradation. 15-second calibration. One per-model, then drop-in for any HuggingFace LLM, ViT, ESM, AlphaFold Evoformer, or VideoMAE. Check out the findings and how the mechanism works below. ↓

Ashwin Gopinath

59,026 Aufrufe • vor 1 Monat

Your local AI just got up to 5x more memory. Same model. Same device. Nearly zero accuracy loss. QVAC SDK 0.12.0 integrates TurboQuant - Google Research's latest memory optimisation algorithm. What is TurboQuant? The KV cache is the memory your model uses to track a conversation. As context grows, it fills up fast. 32K tokens. 64K. Game over. TurboQuant compresses it up to 5x with no accuracy loss. What does it unlock for you? Your app had a 16K token ceiling? It's now 96K. On the same device. Just update the QVAC SDK to get up to 5x more efficiency. No code changes. All from one SDK. The TurboQuant integration unlocks sovereign intelligence for more people, on more devices. Learn more →

Your local AI just got up to 5x more memory. Same model. Same device. Nearly zero accuracy loss. QVAC SDK 0.12.0 integrates TurboQuant - Google Research's latest memory optimisation algorithm. What is TurboQuant? The KV cache is the memory your model uses to track a conversation. As context grows, it fills up fast. 32K tokens. 64K. Game over. TurboQuant compresses it up to 5x with no accuracy loss. What does it unlock for you? Your app had a 16K token ceiling? It's now 96K. On the same device. Just update the QVAC SDK to get up to 5x more efficiency. No code changes. All from one SDK. The TurboQuant integration unlocks sovereign intelligence for more people, on more devices. Learn more →

QVAC

15,798,688 Aufrufe • vor 25 Tagen

Yesterday we announced that the QVAC SDK update unlocked up to 5x more context on your device thanks to TurboQuant. Today, we’ll go through how we got there. TurboQuant (Google Research, ICLR 2026) is a two-stage KV-cache compression algorithm. Stage 1 - PolarQuant: convert KV vectors from Cartesian (x, y, z...) to polar coordinates. Angles compress predictably down to 3-4 bits. Stage 2 - QJL: 1-bit Johnson-Lindenstrauss correction. Cleans up residual error. Total: ~4-5 bits per value. No retraining. No calibration. QVAC ported it to Vulkan inside qvac-fabric-llm.cpp. Currently, TurboQuant is supported only for AMD & NVIDIA GPUs, support for iOS, Android & Apple Silicon coming next. Full algorithm walkthrough + benchmarks + code examples →

Yesterday we announced that the QVAC SDK update unlocked up to 5x more context on your device thanks to TurboQuant. Today, we’ll go through how we got there. TurboQuant (Google Research, ICLR 2026) is a two-stage KV-cache compression algorithm. Stage 1 - PolarQuant: convert KV vectors from Cartesian (x, y, z...) to polar coordinates. Angles compress predictably down to 3-4 bits. Stage 2 - QJL: 1-bit Johnson-Lindenstrauss correction. Cleans up residual error. Total: ~4-5 bits per value. No retraining. No calibration. QVAC ported it to Vulkan inside qvac-fabric-llm.cpp. Currently, TurboQuant is supported only for AMD & NVIDIA GPUs, support for iOS, Android & Apple Silicon coming next. Full algorithm walkthrough + benchmarks + code examples →

QVAC

14,467,728 Aufrufe • vor 24 Tagen

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA loads Try it out: s/o Bryce, the CUDA Colonel and the cuTile team at NVIDIA for lending me Blackwell GPU access :) cc sunny madra Gavin

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA loads Try it out: s/o Bryce, the CUDA Colonel and the cuTile team at NVIDIA for lending me Blackwell GPU access :) cc sunny madra Gavin

ani

806,592 Aufrufe • vor 2 Monaten

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Alok

117,304 Aufrufe • vor 8 Tagen

I connected GPT-4 to my car 🚗 The extra context (8k vs 4k) is very useful for complex use cases - can’t wait to play around with 32k! Greg Brockman

I connected GPT-4 to my car 🚗 The extra context (8k vs 4k) is very useful for complex use cases - can’t wait to play around with 32k! Greg Brockman

Qasim Munye

152,359 Aufrufe • vor 3 Jahren

$Introducing `llama-405b-to-8b` ✍️ Get the quality of Llama 3.1 405B, at a fraction of the cost and latency. Give one example of your task, and 405B will teach 8B (~30x cheaper!!) how to do the task perfectly. And it's open-source:$

Introducing `llama-405b-to-8b` ✍️ Get the quality of Llama 3.1 405B, at a fraction of the cost and latency. Give one example of your task, and 405B will teach 8B (~30x cheaper!!) how to do the task perfectly. And it's open-source:

Matt Shumer

278,384 Aufrufe • vor 1 Jahr

DeepSeek-V4 dropped. 1M context. 10x smaller KV cache. First open model where the context window and the agentic post-training meet.

DeepSeek-V4 dropped. 1M context. 10x smaller KV cache. First open model where the context window and the agentic post-training meet.

Ben Burtenshaw

49,900 Aufrufe • vor 2 Monaten

nvidia's 3B mamba destroyed alibaba's 3B deltanet on the same RTX 3090. only 24 days between releases. same active parameters, same VRAM tier, completely different architectures. nemotron cascade 2: 187 tok/s. flat from 4K to 625K context. zero speed loss. flags: -ngl 99 -np 1. that's it. no context flags, no KV cache tricks. auto-allocates 625K. qwen 3.5 35B-A3B: 112 tok/s. flat from 4K to 262K context. zero speed loss. flags: -ngl 99 -np 1 -c 262144 --cache-type-k q8_0 --cache-type-v q8_0. needed KV cache quantization to fit 262K. both models held a flat line across every context level. both architectures are context-independent. but nvidia's mamba2 is 67% faster at generating tokens on the exact same hardware and needs fewer flags to get there. same node, same GPU, same everything. the only variable is the model. gold medal math olympiad winner running at 187 tokens per second on single RTX 3090 a card from 6 years ago. nvidia cooked.

nvidia's 3B mamba destroyed alibaba's 3B deltanet on the same RTX 3090. only 24 days between releases. same active parameters, same VRAM tier, completely different architectures. nemotron cascade 2: 187 tok/s. flat from 4K to 625K context. zero speed loss. flags: -ngl 99 -np 1. that's it. no context flags, no KV cache tricks. auto-allocates 625K. qwen 3.5 35B-A3B: 112 tok/s. flat from 4K to 262K context. zero speed loss. flags: -ngl 99 -np 1 -c 262144 --cache-type-k q8_0 --cache-type-v q8_0. needed KV cache quantization to fit 262K. both models held a flat line across every context level. both architectures are context-independent. but nvidia's mamba2 is 67% faster at generating tokens on the exact same hardware and needs fewer flags to get there. same node, same GPU, same everything. the only variable is the model. gold medal math olympiad winner running at 187 tokens per second on single RTX 3090 a card from 6 years ago. nvidia cooked.

Sudo su

186,388 Aufrufe • vor 3 Monaten

So Google TurboQuant is basically Pied Piper and just hit a Weismann Score of 5.2

So Google TurboQuant is basically Pied Piper and just hit a Weismann Score of 5.2

K A L E O

476,858 Aufrufe • vor 3 Monaten

Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥 > WhisperSpeech X Llama 3.1 8B > Trained on 50K hours of speech (7 languages) > Continually trained on 45hrs 10x A1000s > MLS -> WhisperVQ tokens -> Llama 3.1 > Instruction tuned on 1.89M samples > 70% speech, 20% transcription, 10% text > Apache 2.0 licensed ⚡ Architecture: > WhisperSpeech/ VQ for Semantic Tokens > Llama 3.1 8B Instruct for Text backbone > Early fusion (Chameleon) I'm super bullish on Homebrew to Menlo and early fusion, audio and text, multimodal models! (P.S. Play with the demo on Hugging Face)

Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥 > WhisperSpeech X Llama 3.1 8B > Trained on 50K hours of speech (7 languages) > Continually trained on 45hrs 10x A1000s > MLS -> WhisperVQ tokens -> Llama 3.1 > Instruction tuned on 1.89M samples > 70% speech, 20% transcription, 10% text > Apache 2.0 licensed ⚡ Architecture: > WhisperSpeech/ VQ for Semantic Tokens > Llama 3.1 8B Instruct for Text backbone > Early fusion (Chameleon) I'm super bullish on Homebrew to Menlo and early fusion, audio and text, multimodal models! (P.S. Play with the demo on Hugging Face)

Vaibhav (VB) Srivastav

82,126 Aufrufe • vor 1 Jahr

Google acaba de publicar un nuevo algoritmo de compresión, se llama +TurboQuant (MasTurboQuant) ¿A quién le habrán copiado la idea?

Google acaba de publicar un nuevo algoritmo de compresión, se llama +TurboQuant (MasTurboQuant) ¿A quién le habrán copiado la idea?

El Programador Senior

156,790 Aufrufe • vor 3 Monaten

Nvidia just revealed Vera Rubin. Ships H2 2026. The numbers are wild: → 10x more performance per watt vs Blackwell → 10x cheaper inference token cost → 4x fewer GPUs to train the same MoE model Energy was the biggest bottleneck in AI. Nvidia just made it 10x cheaper.

Nvidia just revealed Vera Rubin. Ships H2 2026. The numbers are wild: → 10x more performance per watt vs Blackwell → 10x cheaper inference token cost → 4x fewer GPUs to train the same MoE model Energy was the biggest bottleneck in AI. Nvidia just made it 10x cheaper.

Min Choi

285,155 Aufrufe • vor 4 Monaten

I used NotebookLM to study Google new breakthrough with TurboQuant and used Video overview to study the subject, best learning tool in the world at the moment. TurboQuant: Redefining AI Efficiency with Extreme Compression Google Research has introduced TurboQuant, a suite of advanced algorithms designed to dramatically compress the data used by large language models and search engines. By utilizing specialized techniques like PolarQuant and Quantized Johnson-Lindenstrauss, the system transforms complex information into a compact "shorthand" that requires significantly less memory. This innovation specifically addresses the key-value cache bottleneck, allowing AI models to process massive amounts of text faster without losing accuracy. Testing demonstrates that these methods can shrink data size by six times while actually increasing operational speed on high-end hardware. Ultimately, these theoretical advancements enable more efficient semantic search and high-performance AI applications at a global scale.

I used NotebookLM to study Google new breakthrough with TurboQuant and used Video overview to study the subject, best learning tool in the world at the moment. TurboQuant: Redefining AI Efficiency with Extreme Compression Google Research has introduced TurboQuant, a suite of advanced algorithms designed to dramatically compress the data used by large language models and search engines. By utilizing specialized techniques like PolarQuant and Quantized Johnson-Lindenstrauss, the system transforms complex information into a compact "shorthand" that requires significantly less memory. This innovation specifically addresses the key-value cache bottleneck, allowing AI models to process massive amounts of text faster without losing accuracy. Testing demonstrates that these methods can shrink data size by six times while actually increasing operational speed on high-end hardware. Ultimately, these theoretical advancements enable more efficient semantic search and high-performance AI applications at a global scale.

Emily

14,648 Aufrufe • vor 3 Monaten

Qwen3-Next (thinking & non-thinking) are now live in BF16 at Hyperbolic! Qwen3-Next is a huge efficiency leap: - 80B MoE with just 3B active params - 10x cheaper to train vs Qwen3-32B - 10x inference throughput for >32K tokens Proud to be a launch partner with Qwen - kudos to this amazing team for keeping pushing open-source AI forward. We’re the first to serve Qwen3-Next on Hugging Face. Give it a try!

Qwen3-Next (thinking & non-thinking) are now live in BF16 at Hyperbolic! Qwen3-Next is a huge efficiency leap: - 80B MoE with just 3B active params - 10x cheaper to train vs Qwen3-32B - 10x inference throughput for >32K tokens Proud to be a launch partner with Qwen - kudos to this amazing team for keeping pushing open-source AI forward. We’re the first to serve Qwen3-Next on Hugging Face. Give it a try!

Yuchen Jin

84,342 Aufrufe • vor 9 Monaten

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

52,588 Aufrufe • vor 2 Monaten

Jamie Gittens injury, Estevao on leave, and Garnacho just being Garnacho. I need to see Jesse Derry around the Chelsea first team more, he's top quality. Mheuka also 10x better than Delap.

Jamie Gittens injury, Estevao on leave, and Garnacho just being Garnacho. I need to see Jesse Derry around the Chelsea first team more, he's top quality. Mheuka also 10x better than Delap.

JnR

59,599 Aufrufe • vor 4 Monaten

LFM2-Audio just dropped! It's a 1.5B model that understands and generates both text and audio Inference 10x faster + quality on par with models 10x larger Available today on Hugging Face and our playground 🥳

LFM2-Audio just dropped! It's a 1.5B model that understands and generates both text and audio Inference 10x faster + quality on par with models 10x larger Available today on Hugging Face and our playground 🥳

Maxime Labonne

84,952 Aufrufe • vor 8 Monaten

Introducing Kuse 2.0 6x traffic. 10x demo requests. Kuse 1.0 blew up Loved. Hated. Debated. We listened, and improved it at God speed Kuse 2.0 is not just an “AI Dropbox” It’s an AI OS that helps you achieve more by understanding you better - Smarter context - Powerful data & file management - Flexible intent expression, from words to sketches Now in closed beta. Apply your code today ↓

Introducing Kuse 2.0 6x traffic. 10x demo requests. Kuse 1.0 blew up Loved. Hated. Debated. We listened, and improved it at God speed Kuse 2.0 is not just an “AI Dropbox” It’s an AI OS that helps you achieve more by understanding you better - Smarter context - Powerful data & file management - Flexible intent expression, from words to sketches Now in closed beta. Apply your code today ↓

Kuse

853,790 Aufrufe • vor 8 Monaten