正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA... show more

ani

6,923 subscribers

809,610 次观看 • 3 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: Built with Reese Chong. Check out the repos and blog if you want to learn more. Shoutout to Modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc sunny madra Gavin

I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: Built with Reese Chong. Check out the repos and blog if you want to learn more. Shoutout to Modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc sunny madra Gavin

Aadi Kulshrestha

813,943 次观看 • 3 个月前

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

52,588 次观看 • 3 个月前

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

Avi Chawla

269,406 次观看 • 1 个月前

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

Akshay 🚀

57,691 次观看 • 1 天前

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Alok

119,821 次观看 • 1 个月前

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

vLLM

40,306 次观看 • 1 个月前

Inkling, Thinking Machines' first open model, dropped today: 975B total / 41B active MoE, up to 1M context, reasoning natively over text, images, and audio. Serving and RL support are already live: you can run and shape it on an open stack, starting now. Day 0 support on SGLang SGLang and Miles RadixArk👇 - Inkling's new architecture (ShortConv, attention with relative positional embedding, shared expert sink MoE) is natively implemented and deeply optimized, with prefill full CUDA graph and MXFP8 KV cache - Full parameter and LoRA RL in a customized Megatron backend, train inference consistency via customized kernels, routing replay, and cross-runtime parameter synchronization - DFlash speculative decoding from Modal for low-latency serving Launching now, blog and cookbook in the comments ⬇️

Inkling, Thinking Machines' first open model, dropped today: 975B total / 41B active MoE, up to 1M context, reasoning natively over text, images, and audio. Serving and RL support are already live: you can run and shape it on an open stack, starting now. Day 0 support on SGLang SGLang and Miles RadixArk👇 - Inkling's new architecture (ShortConv, attention with relative positional embedding, shared expert sink MoE) is natively implemented and deeply optimized, with prefill full CUDA graph and MXFP8 KV cache - Full parameter and LoRA RL in a customized Megatron backend, train inference consistency via customized kernels, routing replay, and cross-runtime parameter synchronization - DFlash speculative decoding from Modal for low-latency serving Launching now, blog and cookbook in the comments ⬇️

LMSYS Org

143,172 次观看 • 14 天前

QVAC SDK 0.12.0 is now live, bringing longer context, increased memory optimisation, new modalities, and broader ecosystem support directly to your device. Key Features and Updates: - TurboQuant KV-Cache Quantization: Fit much longer context in the same memory. TurboQuant, an algorithm from Google Research, compresses the KV cache by up to 5x, near-lossless. - Text-to-Video: Generate video from a text prompt, fully local, with the new wan2.1 model in the Diffusion addon - Apple Metal Performance for Flux2-klein: Diffusion on Apple Silicon now matches MLX performance, the native benchmark for Apple GPUs - Robot Control (new VLA addon): A GGML-based Vision-Language-Action addon brings fast, efficient robot control to edge devices - Coding Assistant / Harness Support: QVAC now works with OpenCode and OpenClaw as a local provider. A new @qvac/ai-sdk-provider package automates model registry and provider integration - Cross-Platform Voice: Text-to-speech and Parakeet transcription moved from ONNX to the GGML engine for better CPU and GPU support on macOS, iOS, Windows, Linux, and Android. Parakeet also adds long-term streaming diarization (tracking who spoke when on live audio) - Faster Lightweight Visual Classification: A new GGML-based Classification addon delivers millisecond-level classification, useful where a vision-language model (VLM) would be unnecessarily slow - Under the Hood: Fabric synced to llama.cpp v8828 (from v8189), plus GPU acceleration added to image-upscale models for faster results Full release notes:

QVAC SDK 0.12.0 is now live, bringing longer context, increased memory optimisation, new modalities, and broader ecosystem support directly to your device. Key Features and Updates: - TurboQuant KV-Cache Quantization: Fit much longer context in the same memory. TurboQuant, an algorithm from Google Research, compresses the KV cache by up to 5x, near-lossless. - Text-to-Video: Generate video from a text prompt, fully local, with the new wan2.1 model in the Diffusion addon - Apple Metal Performance for Flux2-klein: Diffusion on Apple Silicon now matches MLX performance, the native benchmark for Apple GPUs - Robot Control (new VLA addon): A GGML-based Vision-Language-Action addon brings fast, efficient robot control to edge devices - Coding Assistant / Harness Support: QVAC now works with OpenCode and OpenClaw as a local provider. A new @qvac/ai-sdk-provider package automates model registry and provider integration - Cross-Platform Voice: Text-to-speech and Parakeet transcription moved from ONNX to the GGML engine for better CPU and GPU support on macOS, iOS, Windows, Linux, and Android. Parakeet also adds long-term streaming diarization (tracking who spoke when on live audio) - Faster Lightweight Visual Classification: A new GGML-based Classification addon delivers millisecond-level classification, useful where a vision-language model (VLM) would be unnecessarily slow - Under the Hood: Fabric synced to llama.cpp v8828 (from v8189), plus GPU acceleration added to image-upscale models for faster results Full release notes:

QVAC

9,932,369 次观看 • 1 个月前

A single RTX 4090 (24 GB VRAM) can run the updated gemma 4 31B (dense) model with a 190,000 context window at 33 tokens/second. The VRAM barrier is dying. Google quietly updated Gemma 4, and Unsloth immediately compiled the new quants. I built llama.cpp from source on Ubuntu 22 to benchmark it. Google's stealth update 2 days ago enabled uniform Flash Attention 4 on Hopper to boost prefill and patched the chat template to improve tool calling. The agentic reasoning gains on the benchmark charts are massive: TB2 (Agents): +4.5% (to 25.8%) Tau2 (Telecom): +10.1% (to 62.7%) Running on Ubuntu 22, CUDA 13.0 with a single NVIDIA GeForce RTX 4090. Here is the exact step by step benchmarking process with a massive 28k tokens prompt and the commands I used to squeeze out maximum context without killing my throughput: # 1. The Baseline (Unquantized KV Cache) I started with full GPU offload (-ngl 99) and pushed the context to 40k. llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -ngl 99 -c 40000 -fa on --port 8080 -v VRAM: 23.8 GB (maxed out on card) Throughput: Prefill: 2198.81 t/s | Decode: 35.77 t/s (with 28k tokens prompt) # 2. The CPU Split Trap I tried stretching to 80k context by offloading layers to the CPU (-ngl 52). llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 80000 -ngl 52 -fa on --port 8080 -v Throughput: Prefill: 1212.73 t/s | Decode: 5 t/s (with 28k tokens prompt) # 3. The KV Quantization Breakthrough Instead of spilling layers to the CPU, I kept the model fully on card (-ngl 99) but enabled 8-bit KV cache quantization to free up VRAM. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --port 8080 -v VRAM: 23.9 GB Throughput: Prefill: 2139.68 t/s | Decode: 32 t/s (with 28k tokens prompt) Result: 100k tokens of context on a single GPU with practically zero speed loss (and minimal intelligence loss). # 4. The Limit Test (Q4 KV Cache) To find the absolute breaking point, I dropped the KV cache to 4 bit (q4_0) and set -c 190000. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 190000 --cache-type-k q4_0 --cache-type-v q4_0 -ngl 99 --port 8080 -v VRAM: 23.8 GB Throughput: Prefill: 2206.66 t/s | Decode: 33 t/s (with 28k tokens prompt) (Note: Pushing it to 220k required dropping to -ngl 58 again, which immediately penalized decode down to 17 t/s). # The Tradeoff: For Max Reasoning: Keep your KV cache unquantized (f16). You get pristine reasoning but hit a strict 40k context ceiling. For Massive Document Retrieval: If you need to feed the model giant codebases, use --cache-type-k q4_0. Getting 190k context at 33 tokens/second on a consumer desktop with a 31b dense model is a cheat code. If you’re rocking a single 3090 or 4090 and slept on Gemma 4 earlier, this update is your cue to dust off the terminal. Hugging Face links to the Unsloth QAT quants are in the replies below.

A single RTX 4090 (24 GB VRAM) can run the updated gemma 4 31B (dense) model with a 190,000 context window at 33 tokens/second. The VRAM barrier is dying. Google quietly updated Gemma 4, and Unsloth immediately compiled the new quants. I built llama.cpp from source on Ubuntu 22 to benchmark it. Google's stealth update 2 days ago enabled uniform Flash Attention 4 on Hopper to boost prefill and patched the chat template to improve tool calling. The agentic reasoning gains on the benchmark charts are massive: TB2 (Agents): +4.5% (to 25.8%) Tau2 (Telecom): +10.1% (to 62.7%) Running on Ubuntu 22, CUDA 13.0 with a single NVIDIA GeForce RTX 4090. Here is the exact step by step benchmarking process with a massive 28k tokens prompt and the commands I used to squeeze out maximum context without killing my throughput: # 1. The Baseline (Unquantized KV Cache) I started with full GPU offload (-ngl 99) and pushed the context to 40k. llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -ngl 99 -c 40000 -fa on --port 8080 -v VRAM: 23.8 GB (maxed out on card) Throughput: Prefill: 2198.81 t/s | Decode: 35.77 t/s (with 28k tokens prompt) # 2. The CPU Split Trap I tried stretching to 80k context by offloading layers to the CPU (-ngl 52). llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 80000 -ngl 52 -fa on --port 8080 -v Throughput: Prefill: 1212.73 t/s | Decode: 5 t/s (with 28k tokens prompt) # 3. The KV Quantization Breakthrough Instead of spilling layers to the CPU, I kept the model fully on card (-ngl 99) but enabled 8-bit KV cache quantization to free up VRAM. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --port 8080 -v VRAM: 23.9 GB Throughput: Prefill: 2139.68 t/s | Decode: 32 t/s (with 28k tokens prompt) Result: 100k tokens of context on a single GPU with practically zero speed loss (and minimal intelligence loss). # 4. The Limit Test (Q4 KV Cache) To find the absolute breaking point, I dropped the KV cache to 4 bit (q4_0) and set -c 190000. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 190000 --cache-type-k q4_0 --cache-type-v q4_0 -ngl 99 --port 8080 -v VRAM: 23.8 GB Throughput: Prefill: 2206.66 t/s | Decode: 33 t/s (with 28k tokens prompt) (Note: Pushing it to 220k required dropping to -ngl 58 again, which immediately penalized decode down to 17 t/s). # The Tradeoff: For Max Reasoning: Keep your KV cache unquantized (f16). You get pristine reasoning but hit a strict 40k context ceiling. For Massive Document Retrieval: If you need to feed the model giant codebases, use --cache-type-k q4_0. Getting 190k context at 33 tokens/second on a consumer desktop with a 31b dense model is a cheat code. If you’re rocking a single 3090 or 4090 and slept on Gemma 4 earlier, this update is your cue to dust off the terminal. Hugging Face links to the Unsloth QAT quants are in the replies below.

Alok

75,482 次观看 • 11 天前

I've migrated the old Mast3r-SLAM example I had made last year to the latest version of Rerun and made a bunch of improvements! I wanted to spend some time with agents to modernize it. Here's an example of me walking around with my iPhone and getting a dense reconstruction at about 10FPS on a 5090. Heres the following improvements I made. Brought it into the monorepo with proper packaging: • Using prefix.dev pixi-build to get rid of all the mast3r/asmk/lietorch vendored code with just a few small patches. This let me remove so 60k lines of code from the repo! • Don't have to build the lietorch code on my machine anymore, which was taking ~10 minutes to compile (and also made it work on blackwell when it previously did not) Rebuilt the Gradio interface: • Fixed incremental updates, .MOV uploads, and stop behavior • Made the CLI + Gradio interface share the same entry point so updates automatically propagate Upgraded the Rerun integration: • Switched to a multiprocessing async logging strategy • Added video/pointmap/confidence logging • Improved blueprint layout and hid noisy entities from 3D view • Biggest perf win was the async background logger - documented about a ~2.5x speedup from decoupling logging from tracking The newest and most interesting part was my attempt to replace the CUDA kernels for Gauss-Newton ray matching with a Modular Mojo backend. As a Python dev, every time I look at CUDA code I basically shy away as it's pretty difficult for me to understand. Mojo let me rewrite the matching logic in a syntax I'm more comfortable with while still getting near-CUDA performance. Mojo is now the default matching backend with CUDA fallback. One major piece that's missing is the custom PyTorch op path, but I'll eventually do that as well. I heavily leaned on Claude Code to do the CUDA → Mojo migration, and I have no doubt it's not the cleanest or most idiomatic, BUT it's way more readable for me and helps me better understand the underlying algorithm. This was a ton of work, and a large part of why I'm doing it is how the monorepo compounds. This becomes an artifact for the next example I want to build with Claude that I can point to, which will make it even faster to implement. The compounding nature of this is really interesting and part of why I'm spending so much time trying to make things nice and readable.

I've migrated the old Mast3r-SLAM example I had made last year to the latest version of Rerun and made a bunch of improvements! I wanted to spend some time with agents to modernize it. Here's an example of me walking around with my iPhone and getting a dense reconstruction at about 10FPS on a 5090. Heres the following improvements I made. Brought it into the monorepo with proper packaging: • Using prefix.dev pixi-build to get rid of all the mast3r/asmk/lietorch vendored code with just a few small patches. This let me remove so 60k lines of code from the repo! • Don't have to build the lietorch code on my machine anymore, which was taking ~10 minutes to compile (and also made it work on blackwell when it previously did not) Rebuilt the Gradio interface: • Fixed incremental updates, .MOV uploads, and stop behavior • Made the CLI + Gradio interface share the same entry point so updates automatically propagate Upgraded the Rerun integration: • Switched to a multiprocessing async logging strategy • Added video/pointmap/confidence logging • Improved blueprint layout and hid noisy entities from 3D view • Biggest perf win was the async background logger - documented about a ~2.5x speedup from decoupling logging from tracking The newest and most interesting part was my attempt to replace the CUDA kernels for Gauss-Newton ray matching with a Modular Mojo backend. As a Python dev, every time I look at CUDA code I basically shy away as it's pretty difficult for me to understand. Mojo let me rewrite the matching logic in a syntax I'm more comfortable with while still getting near-CUDA performance. Mojo is now the default matching backend with CUDA fallback. One major piece that's missing is the custom PyTorch op path, but I'll eventually do that as well. I heavily leaned on Claude Code to do the CUDA → Mojo migration, and I have no doubt it's not the cleanest or most idiomatic, BUT it's way more readable for me and helps me better understand the underlying algorithm. This was a ton of work, and a large part of why I'm doing it is how the monorepo compounds. This becomes an artifact for the next example I want to build with Claude that I can point to, which will make it even faster to implement. The compounding nature of this is really interesting and part of why I'm spending so much time trying to make things nice and readable.

Pablo Vela

42,143 次观看 • 3 个月前

QVAC SDK 0.14.0 is live. This release makes the on-device stack faster on mobile, ships the developer-agent path, and takes local text-to-speech to 31 languages. Main highlights: - OpenCode and OpenClaw. The first official OpenCode plugin, plus a maintained OpenClaw compatibility path, both built on managed mode and qvac serve. Point a coding agent at a local model with far less setup and far fewer surprises. - Brain-computer interface transcription, on the SDK. Take recorded neural signal data and decode it into text, fully on-device, no cloud. Stream it in chunks through a simple API. In 0.14 it runs GPU-accelerated on iOS. - Text to Speech in 31 languages with our Supertonic3 upgrade. VOICE AND SPEECH - Supertonic3 multilingual TTS, 5 languages to 31. - Chatterbox and Supertonic now run on the Android GPU, with lower memory use (especially on iOS), quantized s3gen Chatterbox support, and a fix for Chatterbox occasionally emitting random speech. - Whisper transcription now runs on the iOS GPU. Parakeet runs on the Android GPU, with steadier real-time streaming. VISION AND OCR - VLM multi-tile batching: high-resolution Pan and Scan images are encoded in one pass instead of tile by tile, for faster vision throughput. - OCR on ggml (EasyOCR and DocTR) reaches full speed parity with the onnx path, across Metal, OpenCL, and Vulkan. PLATFORM AND RELIABILITY - Dynamic compute backends on Linux: one build picks the right backend at runtime, and opens the door to ROCm and CUDA support without per-backend builds. - Thinking tokens are kept out of the model context, so reasoning no longer fills the KV cache. SDK 0.14.0 is now leaner and faster to start. Let’s build.

QVAC SDK 0.14.0 is live. This release makes the on-device stack faster on mobile, ships the developer-agent path, and takes local text-to-speech to 31 languages. Main highlights: - OpenCode and OpenClaw. The first official OpenCode plugin, plus a maintained OpenClaw compatibility path, both built on managed mode and qvac serve. Point a coding agent at a local model with far less setup and far fewer surprises. - Brain-computer interface transcription, on the SDK. Take recorded neural signal data and decode it into text, fully on-device, no cloud. Stream it in chunks through a simple API. In 0.14 it runs GPU-accelerated on iOS. - Text to Speech in 31 languages with our Supertonic3 upgrade. VOICE AND SPEECH - Supertonic3 multilingual TTS, 5 languages to 31. - Chatterbox and Supertonic now run on the Android GPU, with lower memory use (especially on iOS), quantized s3gen Chatterbox support, and a fix for Chatterbox occasionally emitting random speech. - Whisper transcription now runs on the iOS GPU. Parakeet runs on the Android GPU, with steadier real-time streaming. VISION AND OCR - VLM multi-tile batching: high-resolution Pan and Scan images are encoded in one pass instead of tile by tile, for faster vision throughput. - OCR on ggml (EasyOCR and DocTR) reaches full speed parity with the onnx path, across Metal, OpenCL, and Vulkan. PLATFORM AND RELIABILITY - Dynamic compute backends on Linux: one build picks the right backend at runtime, and opens the door to ROCm and CUDA support without per-backend builds. - Thinking tokens are kept out of the model context, so reasoning no longer fills the KV cache. SDK 0.14.0 is now leaner and faster to start. Let’s build.

QVAC

23,973,950 次观看 • 1 个月前

The Cost of Intelligence is Heading to Zero | Hyperspace P2P Distributed Cache We present to you our breakthrough cross-domain work across AI, distributed systems, cryptography, game theory to solve the primary structural inefficiency at the heart of AI infrastructure: most inference is redundant. Google has reported that only 15% of daily searches are truly novel. The rest are repeats or close variants. LLM inference inherits this same power-law distribution. Enterprise chatbots see 70-80% of queries fall into a handful of intent categories. System prompts are identical across 100% of requests within an application. The KV attention state for "You are a helpful assistant" has been computed billions of times, on millions of GPUs, identically. And yet every AI lab, every startup, every self-hosted deployment - computes and caches these results independently. There is no shared layer. No global memory. Every provider pays the full compute cost for every query, even when the answer already exists somewhere in the network. This is the problem Hyperspace solves where distributed cache operates at three levels, each catching a different class of redundancy: 1. Response cache Same prompt, same model, same parameters - instant cached response from any node in the network. SHA-256 hash lookup via DHT, with cryptographic cache proofs linking every response to its original inference execution. No trust required. Fetchers re-announce as providers, so popular responses replicate naturally across more nodes. 2. KV prefix cache Same system prompt tokens - skip the most expensive part of inference entirely. Prefill (computing Key-Value attention states) is deterministic: same model plus same tokens always produces identical KV state. The network caches these states using erasure coding and distributes them via the routing network. New questions that share a common prefix resume generation from cached state instead of recomputing from scratch. 3. Routing to cached nodes Instead of transferring KV state across the network for every request, Hyperspace routes the request to the node that already has the state loaded in VRAM. The request goes to the cache, not the cache to the request. Together, these three layers mean that 70-90% of inference requests at network scale never require full GPU computation. This work doesn't exist in isolation. It builds on research from across the industry: SGLang's RadixAttention demonstrated that automatic prefix sharing can yield up to 5x speedup on structured LLM workloads. Moonshot AI's Mooncake built an entire KV-cache-centric disaggregated architecture for production serving at Kimi. Anthropic, OpenAI, and Google all launched prompt caching products in 2024 - priced at 50-90% discounts - because system prompt reuse is so pervasive that it changes the economics of inference. What all of these systems share is a common limitation: they operate within a single organization's infrastructure. SGLang caches prefixes within one server. Mooncake disaggregates KV cache within one datacenter. Anthropic's prompt caching works within one API provider's fleet. None of them can share cached state across organizational boundaries. Hyperspace removes this boundary. The cache is global. A response computed by a node in Tokyo is immediately available to a node in Berlin. A KV prefix state generated for Qwen-32B on one machine is verifiable and reusable by any other machine running the same model. The routing network provides the delivery guarantees, the erasure coding provides the redundancy, and the cache proofs provide the trust. What this means for the cost of intelligence Big AI labs scale linearly: twice the users means twice the GPU spend. Every query is a cost center. Their internal caching helps, but it's siloed - Lab A's cache can't serve Lab B's users, and neither can serve a self-hosted Llama deployment. Hyperspace scales sub-linearly. Every new node that joins the network adds to the global cache. Every inference result enriches the cache for all future requests. The cache hit rate rises with network size because query distributions follow a power law - the most common questions are asked exponentially more often than rare ones. The implication is simple: as the network grows, the effective cost per inference drops. Not linearly. Logarithmically. At 10 million nodes, we estimate 75-90% of all inference requests can be served from cache, eliminating 400,000+ MWh of energy consumption per year and avoiding over 200,000 tons of CO2 emissions. The first person to ask a question pays the compute cost. Everyone after them gets the answer for free, with cryptographic proof that it's authentic. Training is competitive. Inference is shared Open-weight models are converging on quality with closed models. Labs will continue to differentiate on training - data curation, architecture innovation, RLHF tuning. That's where the real intellectual property lives. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte, regardless of whose GPU runs the matrix multiplication. There is no moat in multiplying matrices. The moat is in training the weights. A global distributed cache makes this separation explicit. It doesn't matter who trained the model. Once the weights are open, the inference cost approaches zero at scale - because the network remembers every answer and can prove it's correct. No lab, no matter how well-funded, can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically. The marginal cost of intelligence approaches zero. That's the endgame.

The Cost of Intelligence is Heading to Zero | Hyperspace P2P Distributed Cache We present to you our breakthrough cross-domain work across AI, distributed systems, cryptography, game theory to solve the primary structural inefficiency at the heart of AI infrastructure: most inference is redundant. Google has reported that only 15% of daily searches are truly novel. The rest are repeats or close variants. LLM inference inherits this same power-law distribution. Enterprise chatbots see 70-80% of queries fall into a handful of intent categories. System prompts are identical across 100% of requests within an application. The KV attention state for "You are a helpful assistant" has been computed billions of times, on millions of GPUs, identically. And yet every AI lab, every startup, every self-hosted deployment - computes and caches these results independently. There is no shared layer. No global memory. Every provider pays the full compute cost for every query, even when the answer already exists somewhere in the network. This is the problem Hyperspace solves where distributed cache operates at three levels, each catching a different class of redundancy: 1. Response cache Same prompt, same model, same parameters - instant cached response from any node in the network. SHA-256 hash lookup via DHT, with cryptographic cache proofs linking every response to its original inference execution. No trust required. Fetchers re-announce as providers, so popular responses replicate naturally across more nodes. 2. KV prefix cache Same system prompt tokens - skip the most expensive part of inference entirely. Prefill (computing Key-Value attention states) is deterministic: same model plus same tokens always produces identical KV state. The network caches these states using erasure coding and distributes them via the routing network. New questions that share a common prefix resume generation from cached state instead of recomputing from scratch. 3. Routing to cached nodes Instead of transferring KV state across the network for every request, Hyperspace routes the request to the node that already has the state loaded in VRAM. The request goes to the cache, not the cache to the request. Together, these three layers mean that 70-90% of inference requests at network scale never require full GPU computation. This work doesn't exist in isolation. It builds on research from across the industry: SGLang's RadixAttention demonstrated that automatic prefix sharing can yield up to 5x speedup on structured LLM workloads. Moonshot AI's Mooncake built an entire KV-cache-centric disaggregated architecture for production serving at Kimi. Anthropic, OpenAI, and Google all launched prompt caching products in 2024 - priced at 50-90% discounts - because system prompt reuse is so pervasive that it changes the economics of inference. What all of these systems share is a common limitation: they operate within a single organization's infrastructure. SGLang caches prefixes within one server. Mooncake disaggregates KV cache within one datacenter. Anthropic's prompt caching works within one API provider's fleet. None of them can share cached state across organizational boundaries. Hyperspace removes this boundary. The cache is global. A response computed by a node in Tokyo is immediately available to a node in Berlin. A KV prefix state generated for Qwen-32B on one machine is verifiable and reusable by any other machine running the same model. The routing network provides the delivery guarantees, the erasure coding provides the redundancy, and the cache proofs provide the trust. What this means for the cost of intelligence Big AI labs scale linearly: twice the users means twice the GPU spend. Every query is a cost center. Their internal caching helps, but it's siloed - Lab A's cache can't serve Lab B's users, and neither can serve a self-hosted Llama deployment. Hyperspace scales sub-linearly. Every new node that joins the network adds to the global cache. Every inference result enriches the cache for all future requests. The cache hit rate rises with network size because query distributions follow a power law - the most common questions are asked exponentially more often than rare ones. The implication is simple: as the network grows, the effective cost per inference drops. Not linearly. Logarithmically. At 10 million nodes, we estimate 75-90% of all inference requests can be served from cache, eliminating 400,000+ MWh of energy consumption per year and avoiding over 200,000 tons of CO2 emissions. The first person to ask a question pays the compute cost. Everyone after them gets the answer for free, with cryptographic proof that it's authentic. Training is competitive. Inference is shared Open-weight models are converging on quality with closed models. Labs will continue to differentiate on training - data curation, architecture innovation, RLHF tuning. That's where the real intellectual property lives. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte, regardless of whose GPU runs the matrix multiplication. There is no moat in multiplying matrices. The moat is in training the weights. A global distributed cache makes this separation explicit. It doesn't matter who trained the model. Once the weights are open, the inference cost approaches zero at scale - because the network remembers every answer and can prove it's correct. No lab, no matter how well-funded, can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically. The marginal cost of intelligence approaches zero. That's the endgame.

Varun

37,362 次观看 • 4 个月前

I gave a talk at GPU MODE workshop last week on llm.c - the origin story of llm.c - being naked in the world without PyTorch and having to re-invent Array, Autograd, Device, Dtype, Compile, Distributed - how to port a PyTorch layer to 1) explicit PyTorch - and then to 2) write the backward pass - 3) port forward & backward pass to C - 4) string all the layers together - achieving one file of C with no dependencies that compiles and runs ~instantly, where all memory is pre-planned and allocated a single time, fully deterministic, portable code that can run on a potato or a von Neumann probe - how most of llm.c was built at 1am-7am in a water villa porch in Maldives and why this is the recommended way to develop software - convert all of it to run in CUDA on GPU in fp32 - port matmul to cuBLAS - port attention to cuDNN flash-attention - introduce bfloat16 mixed precision - introduce many more optimizations and features like kernel fusions, Packed128, stochastic rounding, full determinism - add multi-GPU training, NCCL, sharded optimizer - add multi-node with MPI or file system or socket - reproduce GPT-2 (1.6B) on one 8XH100 node in 24 hours for $672 in llm.c, achieving (at the time) 29% less memory, 19% faster training that PyTorch nightly, and much faster compile & run - how open source development attracts Avengers from the internet - port to training Llama 3 imminent (branch exists) - many other notable forks - last thought: how software abstractions like Python/PyTorch and everything else really exist only because humans are finite in knowledge, IQ and attention, and how with increasing AI capability LLMs may export custom binaries like llm.c for any application directly, tearing apart and refactoring all abstractions as needed. More links in reply

I gave a talk at GPU MODE workshop last week on llm.c - the origin story of llm.c - being naked in the world without PyTorch and having to re-invent Array, Autograd, Device, Dtype, Compile, Distributed - how to port a PyTorch layer to 1) explicit PyTorch - and then to 2) write the backward pass - 3) port forward & backward pass to C - 4) string all the layers together - achieving one file of C with no dependencies that compiles and runs ~instantly, where all memory is pre-planned and allocated a single time, fully deterministic, portable code that can run on a potato or a von Neumann probe - how most of llm.c was built at 1am-7am in a water villa porch in Maldives and why this is the recommended way to develop software - convert all of it to run in CUDA on GPU in fp32 - port matmul to cuBLAS - port attention to cuDNN flash-attention - introduce bfloat16 mixed precision - introduce many more optimizations and features like kernel fusions, Packed128, stochastic rounding, full determinism - add multi-GPU training, NCCL, sharded optimizer - add multi-node with MPI or file system or socket - reproduce GPT-2 (1.6B) on one 8XH100 node in 24 hours for $672 in llm.c, achieving (at the time) 29% less memory, 19% faster training that PyTorch nightly, and much faster compile & run - how open source development attracts Avengers from the internet - port to training Llama 3 imminent (branch exists) - many other notable forks - last thought: how software abstractions like Python/PyTorch and everything else really exist only because humans are finite in knowledge, IQ and attention, and how with increasing AI capability LLMs may export custom binaries like llm.c for any application directly, tearing apart and refactoring all abstractions as needed. More links in reply

Andrej Karpathy

336,280 次观看 • 1 年前

NVIDIA just handed every solo creator and freelancer an unfair advantage. Jensen Huang walked on stage and announced RTX Spark. An ARM-based laptop chip that nobody saw coming. They called it the most power efficient PC chip ever built. 20 cores. Blackwell graphics. 6144 CUDA cores. Up to 128GB of LPDDR5X memory. But forget the spec sheet for a second. Here is what actually matters. RTX Spark is built to run AI models locally. No cloud subscription. No API costs. No waiting on a server somewhere. Everything runs directly on the laptop at full speed. That changes the math completely for anyone using AI to make money. The guy generating 3D assets in Blender with Claude his renders now take minutes instead of hours. More projects per day. More income per week. The girl producing AI kids content for YouTube local rendering means no upload wait times, no generation limits, no monthly fees eating into her margins. The freelancer building websites and automating outreach every AI tool in his stack now runs faster and cheaper than before. 30 laptops from Asus, Dell, Lenovo, MSI and others. Available this fall. For years the barrier was hardware. You needed an expensive setup to run serious AI workflows locally. NVIDIA just put that power inside a thin laptop anyone can carry anywhere. The people who already figured out how to monetize AI are about to move twice as fast. The people who haven’t started yet just ran out of excuses. Save this.

NVIDIA just handed every solo creator and freelancer an unfair advantage. Jensen Huang walked on stage and announced RTX Spark. An ARM-based laptop chip that nobody saw coming. They called it the most power efficient PC chip ever built. 20 cores. Blackwell graphics. 6144 CUDA cores. Up to 128GB of LPDDR5X memory. But forget the spec sheet for a second. Here is what actually matters. RTX Spark is built to run AI models locally. No cloud subscription. No API costs. No waiting on a server somewhere. Everything runs directly on the laptop at full speed. That changes the math completely for anyone using AI to make money. The guy generating 3D assets in Blender with Claude his renders now take minutes instead of hours. More projects per day. More income per week. The girl producing AI kids content for YouTube local rendering means no upload wait times, no generation limits, no monthly fees eating into her margins. The freelancer building websites and automating outreach every AI tool in his stack now runs faster and cheaper than before. 30 laptops from Asus, Dell, Lenovo, MSI and others. Available this fall. For years the barrier was hardware. You needed an expensive setup to run serious AI workflows locally. NVIDIA just put that power inside a thin laptop anyone can carry anywhere. The people who already figured out how to monetize AI are about to move twice as fast. The people who haven’t started yet just ran out of excuses. Save this.

Shelpid.WI3M

27,738 次观看 • 1 个月前

In just one week, Binh Pham and I trained a full-body Unitree G1. Here's a recap: 1. Secured a Unitree G1 humanoid through a LinkedIn post 2. Deployed TWIST2 full-body teleoperation pipelines 3. Adapted TWIST2 for Zed stereo camera & collected full-body teleoperation samples (carried by Binh Pham ) 4. Adapted & fine-tuned NVIDIA Gr00T N1.5 VLA on the TWIST2 public datasets, which I fine-tuned on an 8xNVIDIA H100 Cluster. We picked Gr00T N1.5 as it was trained with Unitree G1 embodiment data. 5. Adapted the TWIST2 codebase to stream in the actions from Gr00T via ZMQ using a co-located NVIDIA H100 for ~200ms inference latency 6. Tested the model in sim, then deployed to the real-world Unitree G1. We streamed a training sample observation to the VLA (as we didn't want to break robot in case real observations were OOD) We were the first team in the world to deploy the full TWIST2 data collection pipeline to the unitree g1 :) Much more work ahead though, which I'll work on as a side-project over the next months: 1. Exploring the various types of 'world models': video backbones, dynamics models, v-jepa-2 models. I believe these will generalize better & train much more data-efficiently than VLM backbones 2. Speeding up inference - I believe low-latency robotics inference will be a big challenge. There are many works in video diffusion which I'd like to test (e.g. SageAttention, SparseAttention, Drifting Models). Perhaps also writing custom CUDA kernels. 3. Economics of inference scaling :) What will be the compute demands as we scale inference up to millions of humanoids? Will it run on edge or on distributed 'co-located' inference clusters? These are questions I'd like to answer. Adapted TWIST2 codebase: Adapted Gr00T-N1.5 codebase: The ETH Robotics Club are doing a cool GTC Golden ticket competition with NVIDIA , so this is my submission :) The DGX Spark compute will get me a long way with initial prototyping & especially working on inference optimization for next-gen Blackwell GPUs #NVIDIAGTC #GOLDENTICKET #ETHRC

In just one week, Binh Pham and I trained a full-body Unitree G1. Here's a recap: 1. Secured a Unitree G1 humanoid through a LinkedIn post 2. Deployed TWIST2 full-body teleoperation pipelines 3. Adapted TWIST2 for Zed stereo camera & collected full-body teleoperation samples (carried by Binh Pham ) 4. Adapted & fine-tuned NVIDIA Gr00T N1.5 VLA on the TWIST2 public datasets, which I fine-tuned on an 8xNVIDIA H100 Cluster. We picked Gr00T N1.5 as it was trained with Unitree G1 embodiment data. 5. Adapted the TWIST2 codebase to stream in the actions from Gr00T via ZMQ using a co-located NVIDIA H100 for ~200ms inference latency 6. Tested the model in sim, then deployed to the real-world Unitree G1. We streamed a training sample observation to the VLA (as we didn't want to break robot in case real observations were OOD) We were the first team in the world to deploy the full TWIST2 data collection pipeline to the unitree g1 :) Much more work ahead though, which I'll work on as a side-project over the next months: 1. Exploring the various types of 'world models': video backbones, dynamics models, v-jepa-2 models. I believe these will generalize better & train much more data-efficiently than VLM backbones 2. Speeding up inference - I believe low-latency robotics inference will be a big challenge. There are many works in video diffusion which I'd like to test (e.g. SageAttention, SparseAttention, Drifting Models). Perhaps also writing custom CUDA kernels. 3. Economics of inference scaling :) What will be the compute demands as we scale inference up to millions of humanoids? Will it run on edge or on distributed 'co-located' inference clusters? These are questions I'd like to answer. Adapted TWIST2 codebase: Adapted Gr00T-N1.5 codebase: The ETH Robotics Club are doing a cool GTC Golden ticket competition with NVIDIA , so this is my submission :) The DGX Spark compute will get me a long way with initial prototyping & especially working on inference optimization for next-gen Blackwell GPUs #NVIDIAGTC #GOLDENTICKET #ETHRC

Arnie Ramesh

14,815 次观看 • 5 个月前

New interview: Reiner Pope, co-founder/CEO of MatX A counterintuitive throughput insight: “Low latency means small batch sizes. That is just Little’s law. Memory occupancy in HBM is proportional to batch size. So you can actually fit longer contexts than you could if the latency were larger. Low latency is not just a usability win, it improves throughput.” We get into: • The hybrid SRAM + HBM bet, and why pipeline parallelism finally works • Why sparse MoE drives MatX to “the most interconnect of any announced product” • Why frontier labs are willing to bet on an AI ASIC startup • Memory-bandwidth-efficient attention, numerics, and what MatX publishes (and what it does not) • Why 95% of model-side news is noise for chip design • The biggest challenges ahead 00:00 “We left Google one week before ChatGPT” 00:24 Intro: who is MatX 01:17 Origin story: leaving Google for LLM chips 02:21 GPT-3 and the “too expensive” problem 04:25 Why buy hardware that is not a GPU 05:52 Overcoming the CUDA moat 08:46 Early investors 09:35 The name MatX 09:59 The chip: matrix multiply + hybrid SRAM/HBM 12:11 Why pipeline parallelism finally works 14:22 Reading papers and Google going dark 15:20 Research agenda: attention and numerics 17:06 Five specs and meeting customers where they are 19:24 Why frontier labs are the natural first customer 20:32 Workloads: training, prefill, decode 22:18 Little’s law and the throughput case for low latency 24:29 Interconnect and MoE topology 26:35 Inside the team: 100 people, full stack 28:32 Agentic AI: 95% noise for hardware 30:35 KV cache sizing in an agentic world 32:11 How MatX uses AI for chip design (Verilog + BlueSpec) 34:23 Go to market: proving credibility under NDA 35:12 Porting effort for frontier labs 36:34 Biggest skepticism: manufacturing at gigawatt scale 37:32 Hiring plug Vikram Sekar

New interview: Reiner Pope, co-founder/CEO of MatX A counterintuitive throughput insight: “Low latency means small batch sizes. That is just Little’s law. Memory occupancy in HBM is proportional to batch size. So you can actually fit longer contexts than you could if the latency were larger. Low latency is not just a usability win, it improves throughput.” We get into: • The hybrid SRAM + HBM bet, and why pipeline parallelism finally works • Why sparse MoE drives MatX to “the most interconnect of any announced product” • Why frontier labs are willing to bet on an AI ASIC startup • Memory-bandwidth-efficient attention, numerics, and what MatX publishes (and what it does not) • Why 95% of model-side news is noise for chip design • The biggest challenges ahead 00:00 “We left Google one week before ChatGPT” 00:24 Intro: who is MatX 01:17 Origin story: leaving Google for LLM chips 02:21 GPT-3 and the “too expensive” problem 04:25 Why buy hardware that is not a GPU 05:52 Overcoming the CUDA moat 08:46 Early investors 09:35 The name MatX 09:59 The chip: matrix multiply + hybrid SRAM/HBM 12:11 Why pipeline parallelism finally works 14:22 Reading papers and Google going dark 15:20 Research agenda: attention and numerics 17:06 Five specs and meeting customers where they are 19:24 Why frontier labs are the natural first customer 20:32 Workloads: training, prefill, decode 22:18 Little’s law and the throughput case for low latency 24:29 Interconnect and MoE topology 26:35 Inside the team: 100 people, full stack 28:32 Agentic AI: 95% noise for hardware 30:35 KV cache sizing in an agentic world 32:11 How MatX uses AI for chip design (Verilog + BlueSpec) 34:23 Go to market: proving credibility under NDA 35:12 Porting effort for frontier labs 36:34 Biggest skepticism: manufacturing at gigawatt scale 37:32 Hiring plug Vikram Sekar

Semi Doped

19,439 次观看 • 3 个月前

EP-355 with Sudhir Chaudhary premieres today at 5 PM IST “No one pays anyone a single rupee without knowing their market value.” Sudhir Chaudhary on being called the highest-paid government employee “The TRP system has single-handedly destroyed journalism.” Sudhir Chaudhary “I apologised the next day, mistakes happen during live news.” Sudhir Chaudhary on the ‘nano chip’ controversy “All the courts have given me a clean chit.” Sudhir Chaudhary on the alleged extortion case “I don’t know how you found out about it…” Sudhir Chaudhary on making a film #ANIPodcast #SmitaPrakash #SudhirChaudhary #Decode #DD #News #Media Tap 'notify me' to get episode alerts:

EP-355 with Sudhir Chaudhary premieres today at 5 PM IST “No one pays anyone a single rupee without knowing their market value.” Sudhir Chaudhary on being called the highest-paid government employee “The TRP system has single-handedly destroyed journalism.” Sudhir Chaudhary “I apologised the next day, mistakes happen during live news.” Sudhir Chaudhary on the ‘nano chip’ controversy “All the courts have given me a clean chit.” Sudhir Chaudhary on the alleged extortion case “I don’t know how you found out about it…” Sudhir Chaudhary on making a film #ANIPodcast #SmitaPrakash #SudhirChaudhary #Decode #DD #News #Media Tap 'notify me' to get episode alerts:

ANI

178,864 次观看 • 9 个月前

#WATCH | On #GSTReforms, Namit Joshi, Chairman of Pharmexcil and Commercial Director at Centrient Pharmaceuticals, says, "I would like to congratulate our PM for bringing in such a massive reform which no one expected. With this reform of reducing the GST on pharmaceutical from 12% to 5%, and on life-saving medicines nil, I think now it is our responsibility, of the industry, to pass on this benefit to the consumers, to the patients. There is a vision of our PM to give access to medicine, the medicine should be quality-driven and affordable. I think that was one of the components that was making the medicine a little bit on the expensive side, and now with this humongous reduction on the tax front, I think the ultimate beneficiary should be the consumer or the patient. It's one of the very positive steps, it's a celebration time for India and the pharmaceutical industry because ultimately the consumer gets benefit out of it..."

#WATCH | On #GSTReforms, Namit Joshi, Chairman of Pharmexcil and Commercial Director at Centrient Pharmaceuticals, says, "I would like to congratulate our PM for bringing in such a massive reform which no one expected. With this reform of reducing the GST on pharmaceutical from 12% to 5%, and on life-saving medicines nil, I think now it is our responsibility, of the industry, to pass on this benefit to the consumers, to the patients. There is a vision of our PM to give access to medicine, the medicine should be quality-driven and affordable. I think that was one of the components that was making the medicine a little bit on the expensive side, and now with this humongous reduction on the tax front, I think the ultimate beneficiary should be the consumer or the patient. It's one of the very positive steps, it's a celebration time for India and the pharmaceutical industry because ultimately the consumer gets benefit out of it..."

ANI

13,940 次观看 • 10 个月前

#WATCH | Chandigarh: On Colonel Pushpinder Singh Bath assaulted by Punjab Police personnel in Patiala, Lt Gen Mohit Wadhwa, Chief of Staff, HQ Western Command says, "...I am addressing you all about the unfortunate incident on the night of 13th March wherein a serving Colonel Pushpinder Singh Bath of the Indian army was assaulted by certain Punjab policeman outside a dhaba at Patiala...The officer was shifted from the Civil hospital to the military hospital and thereafter underwent treatment at Chandimandar at the Command Hospital and is presently recuperating from his injuries...The Punjab Police have regretted undesirable actions on the part of their personnel. They have identified the policemen involved and issued their immediate suspension as well as transfer out of Patiala...An FIR based on the complaint launched by Colonel Pushpinder Singh Bath was registered at the Civil Lines police station...The probe is now being undertaken by the Special Investigation Team under an additional Director General of Police to be completed in the earliest possible time frame...We reiterate the need for a fair and honest investigation in a transparent and very time-bound manner to punish the guilty and restore faith in the system..."

#WATCH | Chandigarh: On Colonel Pushpinder Singh Bath assaulted by Punjab Police personnel in Patiala, Lt Gen Mohit Wadhwa, Chief of Staff, HQ Western Command says, "...I am addressing you all about the unfortunate incident on the night of 13th March wherein a serving Colonel Pushpinder Singh Bath of the Indian army was assaulted by certain Punjab policeman outside a dhaba at Patiala...The officer was shifted from the Civil hospital to the military hospital and thereafter underwent treatment at Chandimandar at the Command Hospital and is presently recuperating from his injuries...The Punjab Police have regretted undesirable actions on the part of their personnel. They have identified the policemen involved and issued their immediate suspension as well as transfer out of Patiala...An FIR based on the complaint launched by Colonel Pushpinder Singh Bath was registered at the Civil Lines police station...The probe is now being undertaken by the Special Investigation Team under an additional Director General of Police to be completed in the earliest possible time frame...We reiterate the need for a fair and honest investigation in a transparent and very time-bound manner to punish the guilty and restore faith in the system..."

ANI

267,634 次观看 • 1 年前

After 8+ years on the Tesla Autopilot team and 3 years at Intel, I started Apex Compute to design a new architecture for efficient AI inference. For the past 9 months, we’ve been building our custom inference accelerator. Today we’re releasing Unified Engine v1. Last June we raised our seed round with Maxitech , DeepFin Research, Soma Capital and an incredible group of angel investors. In less than 9 months, we completed our RTL architecture and brought our first pre-silicon prototype to life on FPGA. Our architecture combines systolic array and vector processing in a single compute engine with multiple architectural optimizations, achieving very high FLOPs utilization. A single engine is super lean and it uses less than 90K LUTs and 1 MB Block RAM. It may also be one of the smallest logic-footprint compute engines developed so far. Our Unified Engine v1 supports: -matrix-matrix multiplication (~95% FLOPs utilization) -softmax (~90% FLOPs utilization) -broadcast and element-wise operations -RMSNorm / LayerNorm -block quantization/dequantization (fp4, int4) -multi-engine synchronization and many other operations. We even implemented memory-efficient attention similar to FlashAttention, reaching ~90% FLOP utilization. Full benchmarks and the software stack are available on our GitHub: We have basic compiler written in Python and it supports PyTorch tensors directly to easily test and transfer tensors between the accelerator and host using bf16, fp4 and int4 formats. Our FPGA prototype can already run LLM inference and outperform NVIDIA Jetson Orin Nano, even on a mid-tier FPGA setup (6.4x lower memory bandwidth, 18% slower clock speed at 4.5 Watts). Check the side-by-side comparison video below. Our GitHub includes low-level operator implementations, examples for tiled matrix multiplication, operation chaining, tensor parallelism, attention kernel and a full Gemma 3 1B model implementation. Many more models(Vision Transformers and VLA) are coming soon. Our accelerator IP is AXI-ready for deployment on any AMD(Xilinx) FPGA platform today. Even better, our two-engine prototype runs on an entry-level AMD(Xilinx) FPGA as a PCIe accelerator card. You can purchase it here for $50 to experiment our pre-silicon prototype on your desktop PC or Raspberry Pi 5. We will be releasing hardware bitstream updates as the architecture gets new features. More to come soon! We are expanding our team and looking for compiler engineers and floating-point hardware design engineers. If you're interested, please send me a DM.

After 8+ years on the Tesla Autopilot team and 3 years at Intel, I started Apex Compute to design a new architecture for efficient AI inference. For the past 9 months, we’ve been building our custom inference accelerator. Today we’re releasing Unified Engine v1. Last June we raised our seed round with Maxitech , DeepFin Research, Soma Capital and an incredible group of angel investors. In less than 9 months, we completed our RTL architecture and brought our first pre-silicon prototype to life on FPGA. Our architecture combines systolic array and vector processing in a single compute engine with multiple architectural optimizations, achieving very high FLOPs utilization. A single engine is super lean and it uses less than 90K LUTs and 1 MB Block RAM. It may also be one of the smallest logic-footprint compute engines developed so far. Our Unified Engine v1 supports: -matrix-matrix multiplication (~95% FLOPs utilization) -softmax (~90% FLOPs utilization) -broadcast and element-wise operations -RMSNorm / LayerNorm -block quantization/dequantization (fp4, int4) -multi-engine synchronization and many other operations. We even implemented memory-efficient attention similar to FlashAttention, reaching ~90% FLOP utilization. Full benchmarks and the software stack are available on our GitHub: We have basic compiler written in Python and it supports PyTorch tensors directly to easily test and transfer tensors between the accelerator and host using bf16, fp4 and int4 formats. Our FPGA prototype can already run LLM inference and outperform NVIDIA Jetson Orin Nano, even on a mid-tier FPGA setup (6.4x lower memory bandwidth, 18% slower clock speed at 4.5 Watts). Check the side-by-side comparison video below. Our GitHub includes low-level operator implementations, examples for tiled matrix multiplication, operation chaining, tensor parallelism, attention kernel and a full Gemma 3 1B model implementation. Many more models(Vision Transformers and VLA) are coming soon. Our accelerator IP is AXI-ready for deployment on any AMD(Xilinx) FPGA platform today. Even better, our two-engine prototype runs on an entry-level AMD(Xilinx) FPGA as a PCIe accelerator card. You can purchase it here for $50 to experiment our pre-silicon prototype on your desktop PC or Raspberry Pi 5. We will be releasing hardware bitstream updates as the architecture gets new features. More to come soon! We are expanding our team and looking for compiler engineers and floating-point hardware design engineers. If you're interested, please send me a DM.

Hasan

37,603 次观看 • 4 个月前