正在加载视频...

视频加载失败

Speculative decoding speeds up generation from LLMs significantly by computing several potential tokens in parallel. Learn about this technique and how it has been utilized to achieve 2–3x speed-ups at inference:

38,566 次观看 • 1 年前 •via X (Twitter)

5 条评论

Cohorte 的头像
Cohorte1 年前

Instead of generating tokens sequentially, the model computes several possible tokens in parallel. A smaller, faster "draft" model suggests likely next tokens, and a larger model validates and finalizes them.

The AI Veteran 的头像
The AI Veteran1 年前

So, we shouldn't be impressed by longer thinking times? Nice work!

Rodrigo 🇨🇱 WnAI - he/him ه҈̿҈̿҈̿ 的头像
Rodrigo 🇨🇱 WnAI - he/him ه҈̿҈̿҈̿1 年前

Considering speculative decoding and the Johnson-Lindenstrauss: Reducing the dimensionality of the latent space (small model) could distort distances and lose crucial information from the large model. How can we mitigate this effect? Is it like expecting FLAC and getting an MP3?

Caduceus 的头像
Caduceus1 年前

🚀 Speculative decoding is revolutionizing #LLM performance with 2–3x speed-ups at inference by computing multiple tokens in parallel! 🌐 At #Caduceus, we’re advancing #AI and #Web3 with cutting-edge edge rendering tech. Explore the future of innovation!

PrimeURL (for Startups 🏆) 的头像
PrimeURL (for Startups 🏆)1 年前

GenAiStudio

相关视频

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

156,592 次观看 • 1 个月前