正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B... show more

ani

6,793 subscribers

16,560 次观看 • 21 天前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

My favorite editing model, FLUX.2 [klein] 9B, just got 2x faster: Meet FLUX.2 [klein] 9B-KV 😍💨 > Using KV-Cache Optimization to reduce computation & speed up inference by up to 2.5 times for multi-reference editing love how well it edits "around" the bullets

My favorite editing model, FLUX.2 [klein] 9B, just got 2x faster: Meet FLUX.2 [klein] 9B-KV 😍💨 > Using KV-Cache Optimization to reduce computation & speed up inference by up to 2.5 times for multi-reference editing love how well it edits "around" the bullets

Linoy Tsaban

30,134 次观看 • 3 个月前

🎥 Video generation is hitting the memory wall. As videos get longer, the KV cache quietly explodes — and long-horizon consistency starts to break. We built Quant VideoGen: a training-free KV cache compression method for auto-regressive video diffusion. Instead of storing every KV in high precision, QVG exploits video’s spatiotemporal redundancy with semantic-aware smoothing + progressive residual quantization. 🚀 Up to 7× KV memory reduction ⚡ <4% overhead ✅ Strong long-video quality 🕹️ Deploy HYWorldPlay on your own RTX 5090 locally KV compression is becoming a core scaling primitive — not just for LLMs, but for video generation too. Paper: Code: (1/5)

🎥 Video generation is hitting the memory wall. As videos get longer, the KV cache quietly explodes — and long-horizon consistency starts to break. We built Quant VideoGen: a training-free KV cache compression method for auto-regressive video diffusion. Instead of storing every KV in high precision, QVG exploits video’s spatiotemporal redundancy with semantic-aware smoothing + progressive residual quantization. 🚀 Up to 7× KV memory reduction ⚡ <4% overhead ✅ Strong long-video quality 🕹️ Deploy HYWorldPlay on your own RTX 5090 locally KV compression is becoming a core scaling primitive — not just for LLMs, but for video generation too. Paper: Code: (1/5)

Haocheng Xi

64,278 次观看 • 1 个月前

Fireworks blazing fast LLM inference is now available on Poe! Today, we’re excited to bring the power of the new Mistral 7B Instruct model to the Poe powered by our lightning-fast Fireworks inference platform. You can now have conversations with the Mistral 7B bot and even build your own bots on top of it. Try it here:

Fireworks blazing fast LLM inference is now available on Poe! Today, we’re excited to bring the power of the new Mistral 7B Instruct model to the Poe powered by our lightning-fast Fireworks inference platform. You can now have conversations with the Mistral 7B bot and even build your own bots on top of it. Try it here:

Fireworks AI

93,334 次观看 • 2 年前

KV cache yada turboquant yada yada compression something something $DRAM $MU $SNDK

KV cache yada turboquant yada yada compression something something $DRAM $MU $SNDK

TheLAPurchaser

19,786 次观看 • 2 个月前

But the KV cache is created for each transformer layer. By sending each layer’s KV cache after it’s computed, we overlap communication with computation. We stream the KV cache and hide the network delay. We achieve a 4x speedup in prefill & 3x in decode, with 0 network delay.

But the KV cache is created for each transformer layer. By sending each layer’s KV cache after it’s computed, we overlap communication with computation. We stream the KV cache and hide the network delay. We achieve a 4x speedup in prefill & 3x in decode, with 0 network delay.

EXO Labs

22,604 次观看 • 8 个月前

🚀 Self-speculation brings 6.75x real speedup for LLM generation with SGLang inference! Same model drafts future tokens in Diffusion mode → then verifies them in AR (causal) mode. One model and one KV cache. Just different attention masks. Thanks to perfect alignment, we get 2× longer acceptance lengths than MTP techniques (Eagle-3, MTP, dFlash). We run 2 forward passes… but the 2× higher acceptance means we break even - and with zero overhead from extra drafter, KV cache, or LM head that comes with MTP - those are not free. Last week we released Nemotron-Labs-Diffusion + Tri-mode LLMs! We did continued pre-training on Ministral-3 models by switching attention patterns (block causal bidirectional). Result: one model that runs AR mode, Diffusion mode, and Self-Speculation. Diffusion mode already shows high benchmark accuracy - excited to see what happens when someone beats left-to-right acceptance! 🔥 Github: Paper: SGLang inference: Try the models on HF:

🚀 Self-speculation brings 6.75x real speedup for LLM generation with SGLang inference! Same model drafts future tokens in Diffusion mode → then verifies them in AR (causal) mode. One model and one KV cache. Just different attention masks. Thanks to perfect alignment, we get 2× longer acceptance lengths than MTP techniques (Eagle-3, MTP, dFlash). We run 2 forward passes… but the 2× higher acceptance means we break even - and with zero overhead from extra drafter, KV cache, or LM head that comes with MTP - those are not free. Last week we released Nemotron-Labs-Diffusion + Tri-mode LLMs! We did continued pre-training on Ministral-3 models by switching attention patterns (block causal bidirectional). Result: one model that runs AR mode, Diffusion mode, and Self-Speculation. Diffusion mode already shows high benchmark accuracy - excited to see what happens when someone beats left-to-right acceptance! 🔥 Github: Paper: SGLang inference: Try the models on HF:

Pavlo Molchanov

66,270 次观看 • 24 天前

`transformers` + `torchao` quantization + `torch.compile` for faster inference speed and less memory usage 🔥 Demo of "meta-llama/Meta-Llama-3.1-8B-Instruct" quantized in 4-bit weight-only :

`transformers` + `torchao` quantization + `torch.compile` for faster inference speed and less memory usage 🔥 Demo of "meta-llama/Meta-Llama-3.1-8B-Instruct" quantized in 4-bit weight-only :

Marc Sun

24,515 次观看 • 1 年前

We are excited to launch our two models Pharia-1-LLM-7B-control and Pharia-1-LLM-7B-control-aligned. Both models and the code used to train them are now publicly available and open-sourced for non-commercial research and educational use. Read our model blog post here: Learn more about our open-source codebase Scaling: #writtenbyalephalpha

We are excited to launch our two models Pharia-1-LLM-7B-control and Pharia-1-LLM-7B-control-aligned. Both models and the code used to train them are now publicly available and open-sourced for non-commercial research and educational use. Read our model blog post here: Learn more about our open-source codebase Scaling: #writtenbyalephalpha

Aleph Alpha

44,326 次观看 • 1 年前

Happy to OSS gpt-fast, a fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more! Code: Blog: (1/12)

Happy to OSS gpt-fast, a fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more! Code: Blog: (1/12)

Horace He

476,873 次观看 • 2 年前

Robots can now reconstruct 3D scenes in real time from a single RGB camera. [📍 Projects page + paper] No depth sensor. No retraining. 30 FPS. Researchers at the Imperial College London introduced KV-Tracker, a training-free method that makes heavy models like π³ and Depth Anything 3 fast enough for real-time tracking. The idea is simple. These models use global self-attention, which is powerful but computationally expensive. KV-Tracker caches the key and value pairs from selected keyframes and reuses them for new frames. That cache becomes an implicit scene representation. Result: • Up to 30 FPS • 10 to 15x speedup • Accurate 6-DoF tracking on benchmarks like TUM RGB-D and 7-Scenes • Works with monocular RGB only It also supports object-level tracking with masks and allows saving the KV-cache for later reuse. For robotics, this reduces hardware constraints and moves real-time 3D perception closer to practical deployment. Credit to Marwan Taher (Marwan Taher) at Imperial’s Dyson Robotics Lab and many others who contributed to this! 📍 Save projects page + paper for later: Video: ——- if it matters in AI or Robotics you'll read it here first:

Robots can now reconstruct 3D scenes in real time from a single RGB camera. [📍 Projects page + paper] No depth sensor. No retraining. 30 FPS. Researchers at the Imperial College London introduced KV-Tracker, a training-free method that makes heavy models like π³ and Depth Anything 3 fast enough for real-time tracking. The idea is simple. These models use global self-attention, which is powerful but computationally expensive. KV-Tracker caches the key and value pairs from selected keyframes and reuses them for new frames. That cache becomes an implicit scene representation. Result: • Up to 30 FPS • 10 to 15x speedup • Accurate 6-DoF tracking on benchmarks like TUM RGB-D and 7-Scenes • Works with monocular RGB only It also supports object-level tracking with masks and allows saving the KV-cache for later reuse. For robotics, this reduces hardware constraints and moves real-time 3D perception closer to practical deployment. Credit to Marwan Taher (Marwan Taher) at Imperial’s Dyson Robotics Lab and many others who contributed to this! 📍 Save projects page + paper for later: Video: ——- if it matters in AI or Robotics you'll read it here first:

Ilir Aliu

53,825 次观看 • 2 个月前

Introducing DeepThought-8B: Transparent reasoning model built on LLaMA-3.1 with test-time compute scaling. - JSON-structured thought chains & controllable inference paths. - ~16GB VRAM, competitive w/ 70B models. - Open model weights, and inference scripts.

Introducing DeepThought-8B: Transparent reasoning model built on LLaMA-3.1 with test-time compute scaling. - JSON-structured thought chains & controllable inference paths. - ~16GB VRAM, competitive w/ 70B models. - Open model weights, and inference scripts.

Ruliad

219,315 次观看 • 1 年前

Llama 2: Now on Hugging Chat 🤗🦙 Try out the 70B Chat model for free with super fast inference, web search, and powered by open-source tools! 👉

Llama 2: Now on Hugging Chat 🤗🦙 Try out the 70B Chat model for free with super fast inference, web search, and powered by open-source tools! 👉

Hugging Face

403,555 次观看 • 2 年前

Pi now supports Hugging Face Inference 💫 If you have a HF account -> try Kimi K-2.5: scary smart model running 5x faster than what you're used to... Honestly feels like the future of working with agents.

Pi now supports Hugging Face Inference 💫 If you have a HF account -> try Kimi K-2.5: scary smart model running 5x faster than what you're used to... Honestly feels like the future of working with agents.

Victor M

13,853 次观看 • 4 个月前

Very proud to share that we just release Luce KVFlash. Run your preferred model inside Lucebox at 256k context, without thinking about KVCache and OOM, up to 2.9x faster decoding at long context. Taking inspiration from OS paging and using our speculative prefill method (Luce PFlash), we managed to make KV vram usage almost constant. Offloading what is not needed dynamically. Opensource must win now more than ever.

Very proud to share that we just release Luce KVFlash. Run your preferred model inside Lucebox at 256k context, without thinking about KVCache and OOM, up to 2.9x faster decoding at long context. Taking inspiration from OS paging and using our speculative prefill method (Luce PFlash), we managed to make KV vram usage almost constant. Offloading what is not needed dynamically. Opensource must win now more than ever.

mrciffa

23,999 次观看 • 7 天前

parakeet.cpp: native C++/ggml (ggml) inference for NVIDIA AI Developer's Parakeet, one of the best speech-to-text models out there, from the LocalAI team. Every Parakeet model (TDT/CTC/RNNT/hybrid + cache-aware streaming), byte-for-byte identical output to NeMo, now running anywhere with no Python and even a bit faster, on CPU and GPU. Quantized GGUF on Hugging Face 🤗 Huge thanks to Georgi Gerganov for ggml and to NVIDIA AI Developer for releasing Parakeet! 🧵

parakeet.cpp: native C++/ggml (ggml) inference for NVIDIA AI Developer's Parakeet, one of the best speech-to-text models out there, from the LocalAI team. Every Parakeet model (TDT/CTC/RNNT/hybrid + cache-aware streaming), byte-for-byte identical output to NeMo, now running anywhere with no Python and even a bit faster, on CPU and GPU. Quantized GGUF on Hugging Face 🤗 Huge thanks to Georgi Gerganov for ggml and to NVIDIA AI Developer for releasing Parakeet! 🧵

Ettore Di Giacinto

55,426 次观看 • 22 天前

Power of CoreML + Ultralytics YOLO11 on Apple devices! 85 FPS 🥳 This comparison highlights how exporting your PyTorch model to CoreML can significantly boost real-time inference by leveraging Apple’s optimized hardware stack. More details👇 #MachineLearning #research

Power of CoreML + Ultralytics YOLO11 on Apple devices! 85 FPS 🥳 This comparison highlights how exporting your PyTorch model to CoreML can significantly boost real-time inference by leveraging Apple’s optimized hardware stack. More details👇 #MachineLearning #research

Muhammad Rizwan Munawar

85,378 次观看 • 9 个月前

Bro said "let's try the Cache wallbang" and this happened 😭 (via u/HorizonBC)

Bro said "let's try the Cache wallbang" and this happened 😭 (via u/HorizonBC)

Ozzny

438,108 次观看 • 1 个月前

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

Awni Hannun

30,441 次观看 • 2 年前

Opensource low-code tool to build customized LLM orchestration flow and AI agents.

Opensource low-code tool to build customized LLM orchestration flow and AI agents.

Unwind AI

26,745 次观看 • 1 年前

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

Together AI

21,489 次观看 • 1 年前