Loading video...
Video Failed to Load
PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto... show more
261,583 views • 2 years ago •via X (Twitter)
10 Comments

source: github:

This is HUGE. I always suspected that there would be a way to break up the architecture of a model to alleviate the GPU cost and availability bottleneck. Will play with this on LLaMa2, but highly anticipating Mistral-7B.

It's very interesting! 11.69x improvement over llama.cpp sounds a lot, given that it's already super efficient. Are you sure it's not 11.7%? Even that could be a good result.

I almost tried this before I realized it's one more model exchange format 🥲 ".powerinfer.gguf" Can't all of you just use onnx and move on in life? 😭

Nice I was working in this direction recently noting the same things. 12x speedup is very respectable compared to 8x being the best so far achieved by any one approach (e.g. pruning/quantization)!

15 GiBs of weights? And this is supposed to run on Consumer Devices?! 🧐

The week I got a new PC with a beefy CPU. Fucking nice.

Paper tl;dr

Fastest bookmark 🤫🤫🤫

Noob question, is this just for the activation function or is it also cutting down the number of entries for GPU matrix multiplication as well?
