Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Speculative decoding speeds up generation from LLMs significantly by computing several potential tokens in parallel. Learn about this technique and how it has been utilized to achieve 2–3x speed-ups at inference:

Google AI

2,279,785 subscribers

38,566 görüntüleme • 1 yıl önce •via X (Twitter)

Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

5 Yorum

Cohorte profil fotoğrafı

Cohorte1 yıl önce

Instead of generating tokens sequentially, the model computes several possible tokens in parallel. A smaller, faster "draft" model suggests likely next tokens, and a larger model validates and finalizes them.

The AI Veteran profil fotoğrafı

The AI Veteran1 yıl önce

So, we shouldn't be impressed by longer thinking times? Nice work!

Rodrigo 🇨🇱 WnAI - he/him ه҈̿҈̿҈̿ profil fotoğrafı

Rodrigo 🇨🇱 WnAI - he/him ه҈̿҈̿҈̿1 yıl önce

Considering speculative decoding and the Johnson-Lindenstrauss: Reducing the dimensionality of the latent space (small model) could distort distances and lose crucial information from the large model. How can we mitigate this effect? Is it like expecting FLAC and getting an MP3?

Caduceus profil fotoğrafı

Caduceus1 yıl önce

🚀 Speculative decoding is revolutionizing #LLM performance with 2–3x speed-ups at inference by computing multiple tokens in parallel! 🌐 At #Caduceus, we’re advancing #AI and #Web3 with cutting-edge edge rendering tech. Explore the future of innovation!

PrimeURL (for Startups 🏆) profil fotoğrafı

PrimeURL (for Startups 🏆)1 yıl önce

GenAiStudio

Benzer Videolar

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

155,880 görüntüleme • 24 gün önce

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

elvis

261,557 görüntüleme • 2 yıl önce

🧵1/n LLMs significantly improve Evolutionary Algorithms for molecular discovery! For 18 different molecular optimization tasks, we demonstrate how to achieve SOTA performance by incorporating different LLMs! Learn more in our new paper! Website: Code)

🧵1/n LLMs significantly improve Evolutionary Algorithms for molecular discovery! For 18 different molecular optimization tasks, we demonstrate how to achieve SOTA performance by incorporating different LLMs! Learn more in our new paper! Website: Code)

Yuanqi Du

17,892 görüntüleme • 1 yıl önce

Today's autoregressive models generate one token at a time. Mercury 2 generates tokens in parallel. Over 1,000 tok/sec on standard GPUs, at comparable quality to speed-optimized models. Since launch, the community has been showing what diffusion LLMs can unlock. Thanks to the team at Clyep for the breakdown.

Today's autoregressive models generate one token at a time. Mercury 2 generates tokens in parallel. Over 1,000 tok/sec on standard GPUs, at comparable quality to speed-optimized models. Since launch, the community has been showing what diffusion LLMs can unlock. Thanks to the team at Clyep for the breakdown.

Inception

21,021 görüntüleme • 19 gün önce

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: 📰 Paper: 🎥 Demo: #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: 📰 Paper: 🎥 Demo: #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

Yukang Chen

56,424 görüntüleme • 15 gün önce

Energy efficiency in LLM inference has improved 100,000x in the past 10 years — demonstrating that accelerated computing is sustainable computing. Josh Parker, head of sustainability at NVIDIA, explains how. #ClimateWeekNYC Learn more:

Energy efficiency in LLM inference has improved 100,000x in the past 10 years — demonstrating that accelerated computing is sustainable computing. Josh Parker, head of sustainability at NVIDIA, explains how. #ClimateWeekNYC Learn more:

NVIDIA

32,252 görüntüleme • 8 ay önce

◈@Exla_ai speeds up AI on edge devices (e.g. NVIDIA Jetsons) with advanced quantization. Reduce memory use by up to 80% and boost inference speed by 3x–20x— all with just a few lines of code. Congrats on the launch, Viraat & Pranav!

◈@Exla_ai speeds up AI on edge devices (e.g. NVIDIA Jetsons) with advanced quantization. Reduce memory use by up to 80% and boost inference speed by 3x–20x— all with just a few lines of code. Congrats on the launch, Viraat & Pranav!

Y Combinator

19,937 görüntüleme • 1 yıl önce

The example below is using prompt-based speculative decoding. Specifically, ngram hashing is utilized to suggest drafts of up to 64 tokens. The hasher keeps track of ngrams in the observed contexts, so mostly effective for coding tasks. Here is another demo:

The example below is using prompt-based speculative decoding. Specifically, ngram hashing is utilized to suggest drafts of up to 64 tokens. The hasher keeps track of ngrams in the observed contexts, so mostly effective for coding tasks. Here is another demo:

Georgi Gerganov

29,592 görüntüleme • 2 ay önce

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

Andrew Ng

110,864 görüntüleme • 20 gün önce

A great tool to estimate how much VRAM your LLMs actually need. Alter the hardware config, quantization, etc., it tells you about: - Generation speed (tokens/sec) - Precise memory allocation - System throughput, etc. No more VRAM guessing!

A great tool to estimate how much VRAM your LLMs actually need. Alter the hardware config, quantization, etc., it tells you about: - Generation speed (tokens/sec) - Precise memory allocation - System throughput, etc. No more VRAM guessing!

Avi Chawla

22,147 görüntüleme • 7 ay önce

Listen to Samar Khanna explain why parallel generation, rather than sequential, raises the performance ceiling for language models. Learn more about diffusion LLMs. → We're hiring:

Listen to Samar Khanna explain why parallel generation, rather than sequential, raises the performance ceiling for language models. Learn more about diffusion LLMs. → We're hiring:

Inception

18,223 görüntüleme • 3 ay önce

LLM inference speed with vs. without KV caching: (learn how and why it works below)

LLM inference speed with vs. without KV caching: (learn how and why it works below)

Daily Dose of Data Science

59,218 görüntüleme • 1 ay önce

LLM inference speed with vs. without KV caching: (learn how and why it works below)

LLM inference speed with vs. without KV caching: (learn how and why it works below)

Avi Chawla

394,519 görüntüleme • 2 ay önce

There have been some questions about what LCR is and how it’s used in AI models. In this clip, Andrej breaks down how LLMs leverage LCR to pull real-time, up-to-date data from the web. To learn more, check out his tweet on how LCR will change AI forever.

There have been some questions about what LCR is and how it’s used in AI models. In this clip, Andrej breaks down how LLMs leverage LCR to pull real-time, up-to-date data from the web. To learn more, check out his tweet on how LCR will change AI forever.

Grass

268,880 görüntüleme • 1 yıl önce

⚔️ Power up your gameplay Use equipped gear to boost speed, strength, health, and more 💪 Learn how to build Power-Ups in Game Maker 👉

⚔️ Power up your gameplay Use equipped gear to boost speed, strength, health, and more 💪 Learn how to build Power-Ups in Game Maker 👉

The Sandbox

10,757 görüntüleme • 1 yıl önce

Learn a fundamental technique for speeding up character retopology significantly

Learn a fundamental technique for speeding up character retopology significantly

FlippedNormals

145,566 görüntüleme • 1 yıl önce

Introducing DuoAttention: Our new framework slashes both memory and latency for long-context LLMs without sacrificing performance! By applying full KV cache only to critical heads, we achieve: ⚡ 2.55x memory reduction ⚡ 2.18x decoding speedup ⚡ 3.3M tokens on a single A100 GPU

Introducing DuoAttention: Our new framework slashes both memory and latency for long-context LLMs without sacrificing performance! By applying full KV cache only to critical heads, we achieve: ⚡ 2.55x memory reduction ⚡ 2.18x decoding speedup ⚡ 3.3M tokens on a single A100 GPU

Guangxuan Xiao

31,023 görüntüleme • 1 yıl önce

As AI labs race to train and deploy new frontier models, existing models become more affordable with better tokenomics. ✨ "Everybody's trying to get to the next frontier. And every time they get to the next frontier, the last generation AI tokens, the cost starts to decline about a factor of 10x every year," said NVIDIA CEO Jensen Huang in a recent keynote. Model optimization techniques such as speculative decoding and multi-token prediction, combined with inference serving platforms like NVIDIA Dynamo on NVIDIA Blackwell NVL72 systems, enable AI factories to boost throughput by 10x with one-tenth of the cost per token. Learn more about AI factory tokenomics ➡️

As AI labs race to train and deploy new frontier models, existing models become more affordable with better tokenomics. ✨ "Everybody's trying to get to the next frontier. And every time they get to the next frontier, the last generation AI tokens, the cost starts to decline about a factor of 10x every year," said NVIDIA CEO Jensen Huang in a recent keynote. Model optimization techniques such as speculative decoding and multi-token prediction, combined with inference serving platforms like NVIDIA Dynamo on NVIDIA Blackwell NVL72 systems, enable AI factories to boost throughput by 10x with one-tenth of the cost per token. Learn more about AI factory tokenomics ➡️

NVIDIA AI

16,035 görüntüleme • 4 ay önce

[1/N]🚀New decoding paradigm drop!🚀 Introducing Lookahead Reasoning(LR): step-level speculation that stacks with Speculative Decoding(SD). It has been accepted to #NeurIPS2025 🎉 📖 Blog: 💻 Code: 📄 Paper:

[1/N]🚀New decoding paradigm drop!🚀 Introducing Lookahead Reasoning(LR): step-level speculation that stacks with Speculative Decoding(SD). It has been accepted to #NeurIPS2025 🎉 📖 Blog: 💻 Code: 📄 Paper:

Hao AI Lab

42,989 görüntüleme • 8 ay önce

Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase by Rubrik and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here:

Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase by Rubrik and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here:

Andrew Ng

104,727 görüntüleme • 2 yıl önce