Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Speculative decoding speeds up generation from LLMs significantly by computing several potential tokens in parallel. Learn about this technique and how it has been utilized to achieve 2–3x speed-ups at inference:

38,566 Aufrufe • vor 1 Jahr •via X (Twitter)

5 Kommentare

Profilbild von Cohorte
Cohortevor 1 Jahr

Instead of generating tokens sequentially, the model computes several possible tokens in parallel. A smaller, faster "draft" model suggests likely next tokens, and a larger model validates and finalizes them.

Profilbild von The AI Veteran
The AI Veteranvor 1 Jahr

So, we shouldn't be impressed by longer thinking times? Nice work!

Profilbild von Rodrigo 🇨🇱 WnAI - he/him ه҈̿҈̿҈̿
Rodrigo 🇨🇱 WnAI - he/him ه҈̿҈̿҈̿vor 1 Jahr

Considering speculative decoding and the Johnson-Lindenstrauss: Reducing the dimensionality of the latent space (small model) could distort distances and lose crucial information from the large model. How can we mitigate this effect? Is it like expecting FLAC and getting an MP3?

Profilbild von Caduceus
Caduceusvor 1 Jahr

🚀 Speculative decoding is revolutionizing #LLM performance with 2–3x speed-ups at inference by computing multiple tokens in parallel! 🌐 At #Caduceus, we’re advancing #AI and #Web3 with cutting-edge edge rendering tech. Explore the future of innovation!

Profilbild von PrimeURL (for Startups 🏆)
PrimeURL (for Startups 🏆)vor 1 Jahr

GenAiStudio

Ähnliche Videos

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

155,880 Aufrufe • vor 24 Tagen