Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Speculative decoding speeds up generation from LLMs significantly by computing several potential tokens in parallel. Learn about this technique and how it has been utilized to achieve 2–3x speed-ups at inference:

38,566 görüntüleme • 1 yıl önce •via X (Twitter)

5 Yorum

Cohorte profil fotoğrafı
Cohorte1 yıl önce

Instead of generating tokens sequentially, the model computes several possible tokens in parallel. A smaller, faster "draft" model suggests likely next tokens, and a larger model validates and finalizes them.

The AI Veteran profil fotoğrafı
The AI Veteran1 yıl önce

So, we shouldn't be impressed by longer thinking times? Nice work!

Rodrigo 🇨🇱 WnAI - he/him ه҈̿҈̿҈̿ profil fotoğrafı
Rodrigo 🇨🇱 WnAI - he/him ه҈̿҈̿҈̿1 yıl önce

Considering speculative decoding and the Johnson-Lindenstrauss: Reducing the dimensionality of the latent space (small model) could distort distances and lose crucial information from the large model. How can we mitigate this effect? Is it like expecting FLAC and getting an MP3?

Caduceus profil fotoğrafı
Caduceus1 yıl önce

🚀 Speculative decoding is revolutionizing #LLM performance with 2–3x speed-ups at inference by computing multiple tokens in parallel! 🌐 At #Caduceus, we’re advancing #AI and #Web3 with cutting-edge edge rendering tech. Explore the future of innovation!

PrimeURL (for Startups 🏆) profil fotoğrafı
PrimeURL (for Startups 🏆)1 yıl önce

GenAiStudio

Benzer Videolar

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

155,880 görüntüleme • 24 gün önce