ani

@anirudhbv_ce • 6,923 subscribers

19, incoming @nvidia / ml @sentra_app (a16z) | prev. ml @shopify @voiceflow | mlh top 50 | uwaterloo ce '30

Shorts

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B / Llama 3.1 8B verified - Pure PyTorch + future CUDA kernel support - Auto-calibration from a bundled corpus 📰 Paper: 💻 Code, quickstart, and benchmarks: #LLM #Inference #PyTorch #OpenSource #MachineLearning #LLM #KVCache #Inference

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B / Llama 3.1 8B verified - Pure PyTorch + future CUDA kernel support - Auto-calibration from a bundled corpus 📰 Paper: 💻 Code, quickstart, and benchmarks: #LLM #Inference #PyTorch #OpenSource #MachineLearning #LLM #KVCache #Inference

17,235 Aufrufe

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA loads Try it out: s/o Bryce, the CUDA Colonel and the cuTile team at NVIDIA for lending me Blackwell GPU access :) cc sunny madra Gavin

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA loads Try it out: s/o Bryce, the CUDA Colonel and the cuTile team at NVIDIA for lending me Blackwell GPU access :) cc sunny madra Gavin

809,610 Aufrufe • vor 3 Monaten

Keine weiteren Inhalte verfügbar