Loading video...

Video Failed to Load

Go Home

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto...

261,583 views • 2 years ago •via X (Twitter)

10 Comments

elvis's profile picture
elvis2 years ago

source: github:

Adam Pippert's profile picture
Adam Pippert2 years ago

This is HUGE. I always suspected that there would be a way to break up the architecture of a model to alleviate the GPU cost and availability bottleneck. Will play with this on LLaMa2, but highly anticipating Mistral-7B.

Hamid R. Darabi's profile picture
Hamid R. Darabi2 years ago

It's very interesting! 11.69x improvement over llama.cpp sounds a lot, given that it's already super efficient. Are you sure it's not 11.7%? Even that could be a good result.

Loki (cute/acc)'s profile picture
Loki (cute/acc)2 years ago

I almost tried this before I realized it's one more model exchange format 🥲 ".powerinfer.gguf" Can't all of you just use onnx and move on in life? 😭

catid (e/acc)'s profile picture
catid (e/acc)2 years ago

Nice I was working in this direction recently noting the same things. 12x speedup is very respectable compared to 8x being the best so far achieved by any one approach (e.g. pruning/quantization)!

Filippo Pedrazzini's profile picture
Filippo Pedrazzini2 years ago

15 GiBs of weights? And this is supposed to run on Consumer Devices?! 🧐

Medium Boss - 70b_Float16.Q8.gguf's profile picture
Medium Boss - 70b_Float16.Q8.gguf2 years ago

The week I got a new PC with a beefy CPU. Fucking nice.

Sahar Mor's profile picture
Sahar Mor2 years ago

Paper tl;dr

s3nh's profile picture
s3nh2 years ago

Fastest bookmark 🤫🤫🤫

Ruairi's profile picture
Ruairi2 years ago

Noob question, is this just for the activation function or is it also cutting down the number of entries for GPU matrix multiplication as well?

Related Videos

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. Kog just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

Rohan Paul

13,080 views • 26 days ago

Jensen Huang just identified the next $200 billion market (Save this). The shift starts with a observation about agentic AI that changes everything about infrastructure. In the era of training and inference, the GPU was everything while CPU was a traffic cop, scheduling work, managing memory, dispatching tasks while the GPU did the heavy lifting. Agentic AI breaks that model entirely. An AI agent does not just run a single inference pass but rather it plans, calls tools, executes code in sandboxes, retrieves data from multiple sources and loops through complex multi-step reasoning sequences often thousands of times per second at scale. Every one of those operations runs through the CPU and the GPU sits idle waiting for the CPU to prepare the next task, supply the right context and execute the retrieval and tool calling logic fast enough to keep the accelerators fed. The CPU is now the conductor and the GPU is the orchestra and the bottleneck is the conductor falling behind. This is showing up in production AI factory utilization right now, which is exactly why Jensen built Vera from scratch rather than licensing x86. Vera achieves 40% lower peak memory latency than x86, 50% faster core to core communication, and 1.8 times the agentic sandbox performance of current x86 processors on a purpose-built architecture designed around the agentic loop. Now here is where the investment thesis gets interesting. The obvious beneficiary is Nvidia itself, and that thesis is real. Nvidia's CFO has guided for nearly $20 billion in Vera CPU revenue this fiscal year alone, a market Nvidia had zero presence in just three years ago. Intel held 60% of server CPU market share as recently as Q4 2025 and that transition is now happening at a pace Intel structurally cannot respond to. But the deeper question is, what architecture is Vera actually built on? Vera's Olympus cores are ARM compatible and every single Vera CPU deployed in every Vera Rubin rack in every data center in the world runs on ARM architecture. And ARM Holdings collects a royalty on every one of them. ARM does not make chips but rather licenses the instruction set architecture and CPU core designs that others build on top of. Every time Nvidia ships a Vera CPU, every time a hyperscaler deploys a Vera Rubin rack, every time an enterprise qualifies Vera for their AI factory, ARM earns a royalty. The secular tailwind here is almost perfectly constructed for ARM's business model. Amazon's Graviton, Microsoft's Cobalt, Google's Axion, Apple's silicon stack, and Qualcomm's data center push all run on ARM. And now Nvidia's Vera, which is projected to displace Intel as the largest server CPU supplier by revenue in a single fiscal year, is ARM. ARM's royalty rate on high end server chips is estimated at roughly 1 to 2% of chip selling price. At $5,000 per Vera CPU and 4 million units projected for FY2027, that is a royalty line growing from near zero to potentially $400 million to $800 million annually from Nvidia's data center CPU business alone before counting Amazon, Microsoft, Google, Apple, and Qualcomm. The total ARM addressable royalty base across all the silicon it already licenses is compounding at a rate that the current $130 billion market cap does not fully reflect. Jensen's CPU thesis is the most underappreciated catalyst in ARM's fundamental story, and the royalty compounding has barely started. Come join Milk Road Pro and get our full ARM royalty model and our entire AI trade thesis. Link below!

Milk Road AI

11,785 views • 20 days ago