Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

But the KV cache is created for each transformer layer. By sending each layer’s KV cache after it’s computed, we overlap communication with computation. We stream the KV cache and hide the network delay. We achieve a 4x speedup in prefill & 3x in decode, with 0 network delay.

EXO Labs

42,045 subscribers

22,604 Aufrufe • vor 8 Monaten •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

🎥 Video generation is hitting the memory wall. As videos get longer, the KV cache quietly explodes — and long-horizon consistency starts to break. We built Quant VideoGen: a training-free KV cache compression method for auto-regressive video diffusion. Instead of storing every KV in high precision, QVG exploits video’s spatiotemporal redundancy with semantic-aware smoothing + progressive residual quantization. 🚀 Up to 7× KV memory reduction ⚡ <4% overhead ✅ Strong long-video quality 🕹️ Deploy HYWorldPlay on your own RTX 5090 locally KV compression is becoming a core scaling primitive — not just for LLMs, but for video generation too. Paper: Code: (1/5)

🎥 Video generation is hitting the memory wall. As videos get longer, the KV cache quietly explodes — and long-horizon consistency starts to break. We built Quant VideoGen: a training-free KV cache compression method for auto-regressive video diffusion. Instead of storing every KV in high precision, QVG exploits video’s spatiotemporal redundancy with semantic-aware smoothing + progressive residual quantization. 🚀 Up to 7× KV memory reduction ⚡ <4% overhead ✅ Strong long-video quality 🕹️ Deploy HYWorldPlay on your own RTX 5090 locally KV compression is becoming a core scaling primitive — not just for LLMs, but for video generation too. Paper: Code: (1/5)

Haocheng Xi

64,520 Aufrufe • vor 2 Monaten

KV cache yada turboquant yada yada compression something something $DRAM $MU $SNDK

KV cache yada turboquant yada yada compression something something $DRAM $MU $SNDK

TheLAPurchaser

19,786 Aufrufe • vor 2 Monaten

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B / Llama 3.1 8B verified - Pure PyTorch + future CUDA kernel support - Auto-calibration from a bundled corpus 📰 Paper: 💻 Code, quickstart, and benchmarks: #LLM #Inference #PyTorch #OpenSource #MachineLearning #LLM #KVCache #Inference

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B / Llama 3.1 8B verified - Pure PyTorch + future CUDA kernel support - Auto-calibration from a bundled corpus 📰 Paper: 💻 Code, quickstart, and benchmarks: #LLM #Inference #PyTorch #OpenSource #MachineLearning #LLM #KVCache #Inference

ani

16,583 Aufrufe • vor 1 Monat

🚀 Self-speculation brings 6.75x real speedup for LLM generation with SGLang inference! Same model drafts future tokens in Diffusion mode → then verifies them in AR (causal) mode. One model and one KV cache. Just different attention masks. Thanks to perfect alignment, we get 2× longer acceptance lengths than MTP techniques (Eagle-3, MTP, dFlash). We run 2 forward passes… but the 2× higher acceptance means we break even - and with zero overhead from extra drafter, KV cache, or LM head that comes with MTP - those are not free. Last week we released Nemotron-Labs-Diffusion + Tri-mode LLMs! We did continued pre-training on Ministral-3 models by switching attention patterns (block causal bidirectional). Result: one model that runs AR mode, Diffusion mode, and Self-Speculation. Diffusion mode already shows high benchmark accuracy - excited to see what happens when someone beats left-to-right acceptance! 🔥 Github: Paper: SGLang inference: Try the models on HF:

🚀 Self-speculation brings 6.75x real speedup for LLM generation with SGLang inference! Same model drafts future tokens in Diffusion mode → then verifies them in AR (causal) mode. One model and one KV cache. Just different attention masks. Thanks to perfect alignment, we get 2× longer acceptance lengths than MTP techniques (Eagle-3, MTP, dFlash). We run 2 forward passes… but the 2× higher acceptance means we break even - and with zero overhead from extra drafter, KV cache, or LM head that comes with MTP - those are not free. Last week we released Nemotron-Labs-Diffusion + Tri-mode LLMs! We did continued pre-training on Ministral-3 models by switching attention patterns (block causal bidirectional). Result: one model that runs AR mode, Diffusion mode, and Self-Speculation. Diffusion mode already shows high benchmark accuracy - excited to see what happens when someone beats left-to-right acceptance! 🔥 Github: Paper: SGLang inference: Try the models on HF:

Pavlo Molchanov

66,354 Aufrufe • vor 1 Monat

Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM: Prefill: 1000+ tok/s (42% increase) Decode: 25+ tok/s (25% increase) Context: 120k (150% increase) prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments) llama.cpp TurboQuant flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --port 8080 tested with a 27k prompt, 120k context loaded. -ngl 99 here isn't a typo, full 12B dense, every layer on GPU, on an 8GB card. that's the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card. TurboQuant's KV cache savings are what free up the room to do that at 120k context. side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+ rig: RTX 4060 8GB · i7H · 16GB RAM same two flags as yesterday, different model size: --cache-type-k q8_0 --cache-type-v turbo3 thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (Tom Turney) to make this work. unsloth's model quant huggingface and the llama.cpp fork github link in the comments Do you prefer a dense or a MoE for your 8GB card?

Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM: Prefill: 1000+ tok/s (42% increase) Decode: 25+ tok/s (25% increase) Context: 120k (150% increase) prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments) llama.cpp TurboQuant flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --port 8080 tested with a 27k prompt, 120k context loaded. -ngl 99 here isn't a typo, full 12B dense, every layer on GPU, on an 8GB card. that's the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card. TurboQuant's KV cache savings are what free up the room to do that at 120k context. side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+ rig: RTX 4060 8GB · i7H · 16GB RAM same two flags as yesterday, different model size: --cache-type-k q8_0 --cache-type-v turbo3 thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (Tom Turney) to make this work. unsloth's model quant huggingface and the llama.cpp fork github link in the comments Do you prefer a dense or a MoE for your 8GB card?

Alok

34,500 Aufrufe • vor 14 Tagen

Epic is working on a way for you to quickly clear the games cache manually! Once you clear the cache it does point out that some games will take longer to load after cache has been cleared as it needs to download/cache again.

Epic is working on a way for you to quickly clear the games cache manually! Once you clear the cache it does point out that some games will take longer to load after cache has been cleared as it needs to download/cache again.

Hybrid

18,239 Aufrufe • vor 1 Jahr

Robots can now reconstruct 3D scenes in real time from a single RGB camera. [📍 Projects page + paper] No depth sensor. No retraining. 30 FPS. Researchers at the Imperial College London introduced KV-Tracker, a training-free method that makes heavy models like π³ and Depth Anything 3 fast enough for real-time tracking. The idea is simple. These models use global self-attention, which is powerful but computationally expensive. KV-Tracker caches the key and value pairs from selected keyframes and reuses them for new frames. That cache becomes an implicit scene representation. Result: • Up to 30 FPS • 10 to 15x speedup • Accurate 6-DoF tracking on benchmarks like TUM RGB-D and 7-Scenes • Works with monocular RGB only It also supports object-level tracking with masks and allows saving the KV-cache for later reuse. For robotics, this reduces hardware constraints and moves real-time 3D perception closer to practical deployment. Credit to Marwan Taher (Marwan Taher) at Imperial’s Dyson Robotics Lab and many others who contributed to this! 📍 Save projects page + paper for later: Video: ——- if it matters in AI or Robotics you'll read it here first:

Robots can now reconstruct 3D scenes in real time from a single RGB camera. [📍 Projects page + paper] No depth sensor. No retraining. 30 FPS. Researchers at the Imperial College London introduced KV-Tracker, a training-free method that makes heavy models like π³ and Depth Anything 3 fast enough for real-time tracking. The idea is simple. These models use global self-attention, which is powerful but computationally expensive. KV-Tracker caches the key and value pairs from selected keyframes and reuses them for new frames. That cache becomes an implicit scene representation. Result: • Up to 30 FPS • 10 to 15x speedup • Accurate 6-DoF tracking on benchmarks like TUM RGB-D and 7-Scenes • Works with monocular RGB only It also supports object-level tracking with masks and allows saving the KV-cache for later reuse. For robotics, this reduces hardware constraints and moves real-time 3D perception closer to practical deployment. Credit to Marwan Taher (Marwan Taher) at Imperial’s Dyson Robotics Lab and many others who contributed to this! 📍 Save projects page + paper for later: Video: ——- if it matters in AI or Robotics you'll read it here first:

Ilir Aliu

53,866 Aufrufe • vor 2 Monaten

My Saturday project: make HEY load faster in Ladybird 👋🚀 Got it down from ~1 s to ~400 ms with two changes: - Our HTTP cache now supports heuristic freshness lifetime (per RFCs 9110 & 9111) - We delay rendering until we have all CSS imports (no more FOUC!)

My Saturday project: make HEY load faster in Ladybird 👋🚀 Got it down from ~1 s to ~400 ms with two changes: - Our HTTP cache now supports heuristic freshness lifetime (per RFCs 9110 & 9111) - We delay rendering until we have all CSS imports (no more FOUC!)

Andreas Kling

97,646 Aufrufe • vor 7 Monaten

gemma-4-12B-agentic-fable5-composer2.5 V2 is out. the agentic upgrade to the model trained on Fable 5's reasoning. Running it now with TurboQuant llama.cpp on a single RTX 4060( 8 GB VRAM) at 30 tokens/second with full 25000 context and reasoning: # The benchmarks v2 is built for coding + agentic work. writing code, running commands, using tools, debugging, multi step technical tasks. The clearest signal is tau2 bench telecom, an agentic tool use benchmark whose diagnose → fix → verify loop mirrors real terminal/debugging work: tau2 bench telecom numbers: base Gemma 4 12B: ~15% this finetune: ~55%. (Self reported) thats a huge jump # TheTom/llama-cpp-turboquant flags: llama-server.exe -m gemma4-v2-Q4_K_M.gguf -ngl 99 -c 25000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 Flag breakdown: -ngl 99 → full GPU offload -c 25000 → 25K context --cache-type-k q8_0 --cache-type-v turbo3 → mixed-precision KV cache — K at 8-bit, V at ~3-bit via TurboQuant (Walsh Hadamard rotated polar quant, Google's own KV-compression research). Not even merged into mainline llama.cpp. running it off a fork. No API. No cloud. Just llama.cpp. well, a fork of it and any 6gb+ GPU. If you tried yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF, check this out and share your experience with the models

gemma-4-12B-agentic-fable5-composer2.5 V2 is out. the agentic upgrade to the model trained on Fable 5's reasoning. Running it now with TurboQuant llama.cpp on a single RTX 4060( 8 GB VRAM) at 30 tokens/second with full 25000 context and reasoning: # The benchmarks v2 is built for coding + agentic work. writing code, running commands, using tools, debugging, multi step technical tasks. The clearest signal is tau2 bench telecom, an agentic tool use benchmark whose diagnose → fix → verify loop mirrors real terminal/debugging work: tau2 bench telecom numbers: base Gemma 4 12B: ~15% this finetune: ~55%. (Self reported) thats a huge jump # TheTom/llama-cpp-turboquant flags: llama-server.exe -m gemma4-v2-Q4_K_M.gguf -ngl 99 -c 25000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 Flag breakdown: -ngl 99 → full GPU offload -c 25000 → 25K context --cache-type-k q8_0 --cache-type-v turbo3 → mixed-precision KV cache — K at 8-bit, V at ~3-bit via TurboQuant (Walsh Hadamard rotated polar quant, Google's own KV-compression research). Not even merged into mainline llama.cpp. running it off a fork. No API. No cloud. Just llama.cpp. well, a fork of it and any 6gb+ GPU. If you tried yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF, check this out and share your experience with the models

Alok

143,589 Aufrufe • vor 11 Tagen

🌐 The #Streamr Network 1.0 mainnet is GO for launch in 48 hours! 🚀 After 6+ years in research & development, get ready for a fully decentralized, feature-complete data broadcasting network, powered by its users. We hope you’re as excited as we are! 🧡

🌐 The #Streamr Network 1.0 mainnet is GO for launch in 48 hours! 🚀 After 6+ years in research & development, get ready for a fully decentralized, feature-complete data broadcasting network, powered by its users. We hope you’re as excited as we are! 🧡

Streamr Network

17,923 Aufrufe • vor 2 Jahren

Sorry for the delay of pictures and videos from #Itzulia2024, but we had to deal with an unauthorised interview 😂

Sorry for the delay of pictures and videos from #Itzulia2024, but we had to deal with an unauthorised interview 😂

Soudal Quick-Step Pro Cycling Team

173,176 Aufrufe • vor 2 Jahren

A Russian Fibre-optic FPV drone struck a transformer at the 110 kV electrical substation in the town of Stary Saltiv, Kharkiv Oblast, resulting in a fire breaking out. Coordinates: 50.06661, 36.76997

A Russian Fibre-optic FPV drone struck a transformer at the 110 kV electrical substation in the town of Stary Saltiv, Kharkiv Oblast, resulting in a fire breaking out. Coordinates: 50.06661, 36.76997

AMK Mapping 🇳🇿

26,003 Aufrufe • vor 9 Tagen

We’re preparing to ship thousands of humanoid robots Here, we are showing 4x humanoid robots, each powered by its own Helix neural network

We’re preparing to ship thousands of humanoid robots Here, we are showing 4x humanoid robots, each powered by its own Helix neural network

Brett Adcock

1,186,839 Aufrufe • vor 1 Jahr

#WATCH | Delhi | After attending the dinner hosted by Rajya Sabha LoP Mallikarjun Kharge for the INDIA bloc leaders, Shiv Sena (UBT) MP Priyanka Chaturvedi says, "We socialised with each other keeping politics aside, and talked about family matters. We met each other in a good atmosphere. The unity of the INDIA Alliance is in front of you all."

#WATCH | Delhi | After attending the dinner hosted by Rajya Sabha LoP Mallikarjun Kharge for the INDIA bloc leaders, Shiv Sena (UBT) MP Priyanka Chaturvedi says, "We socialised with each other keeping politics aside, and talked about family matters. We met each other in a good atmosphere. The unity of the INDIA Alliance is in front of you all."

ANI

58,515 Aufrufe • vor 10 Monaten

One million particles created using GitHub Spark, using the KV data to configure and remember the properties #GitHubUniverse #WebGL

One million particles created using GitHub Spark, using the KV data to configure and remember the properties #GitHubUniverse #WebGL

Terkel 𓀒

30,709 Aufrufe • vor 1 Jahr

🔥 Mariupol, attack on PS 220 kV Azovskaya, there's a blackout in the city.

🔥 Mariupol, attack on PS 220 kV Azovskaya, there's a blackout in the city.

MAKS 25 🇺🇦👀

13,233 Aufrufe • vor 5 Monaten

In a separate incident, the IDF says the paratroopers located a cache of Hezbollah weapons.

In a separate incident, the IDF says the paratroopers located a cache of Hezbollah weapons.

Emanuel (Mannie) Fabian

12,224 Aufrufe • vor 2 Monaten

the flashback on the new Cache

the flashback on the new Cache

NAVI

285,702 Aufrufe • vor 1 Jahr

FMPONE COOKED WITH CACHE HOLYYY

FMPONE COOKED WITH CACHE HOLYYY

dima_wallhacks

343,271 Aufrufe • vor 1 Jahr

#VishwakSen's birthday celebrations on the sets of #Funky, joined by Anudeep KV and #KayaduLohar.

#VishwakSen's birthday celebrations on the sets of #Funky, joined by Anudeep KV and #KayaduLohar.

Gulte

252,911 Aufrufe • vor 1 Jahr