
ani
@anirudhbv_ce • 6,795 subscribers
19, incoming @nvidia / ml @sentra_app (a16z) | prev. ml @shopify @voiceflow | mlh top 50 | uwaterloo ce '30
Videos

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA loads Try it out: s/o Bryce, the CUDA Colonel and the cuTile team at NVIDIA for lending me Blackwell GPU access :) cc sunny madra Gavin
ani805,934 Aufrufe • vor 2 Monaten
Keine weiteren Inhalte verfügbar