Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

The example below is using prompt-based speculative decoding. Specifically, ngram hashing is utilized to suggest drafts of up to 64 tokens. The hasher keeps track of ngrams in the observed contexts, so mostly effective for coding tasks. Here is another demo:

Georgi Gerganov

62,376 subscribers

29,815 views • 3 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

GPT-5.4 is here. Native computer-use capabilities. Up to 1M tokens of context in Codex and the API. Best-in-class agentic coding for complex tasks. Scalable tool search across larger ecosystems. More efficient reasoning for long, tool-heavy workflows.

GPT-5.4 is here. Native computer-use capabilities. Up to 1M tokens of context in Codex and the API. Best-in-class agentic coding for complex tasks. Scalable tool search across larger ecosystems. More efficient reasoning for long, tool-heavy workflows.

OpenAI Developers

1,074,161 views • 4 months ago

Phang: And you’ve got to appreciate that the congresswoman was able to say, “Look, this ain’t Fox News, buddy. This is Congress.” So if you’re going to be here and you’re going to try to make an allegation somewhere publicly, you better be able to back it up. And that is another example of effective politicking. That is effective work that is being done

Phang: And you’ve got to appreciate that the congresswoman was able to say, “Look, this ain’t Fox News, buddy. This is Congress.” So if you’re going to be here and you’re going to try to make an allegation somewhere publicly, you better be able to back it up. And that is another example of effective politicking. That is effective work that is being done

Acyn

121,365 views • 1 year ago

To the Pfizer lobby in India, here is the perfect example of #VishwaguruBharat.

To the Pfizer lobby in India, here is the perfect example of #VishwaguruBharat.

Anshul Saxena

370,067 views • 3 years ago

More experiments of this workflow: Generating quick 3d models using Lumas Genie 3D. A text prompt generates 4 models in 10 seconds. Then, using Lumas webviewer, I feed a screencapture of the model into Krea ai which details the model based on another prompt. Finally, Magnific AI is used to add the final finish. The Santa mechs are insanely refined😱 Check out the final "Renders" below. #3d #ai

More experiments of this workflow: Generating quick 3d models using Lumas Genie 3D. A text prompt generates 4 models in 10 seconds. Then, using Lumas webviewer, I feed a screencapture of the model into Krea ai which details the model based on another prompt. Finally, Magnific AI is used to add the final finish. The Santa mechs are insanely refined😱 Check out the final "Renders" below. #3d #ai

Martin Nebelong

88,080 views • 2 years ago

Quick demo of how I use Pi to offload tasks to Codex CLI with the pi-interactive-shell extension. Codex is notoriously tricky to prompt well so Pi reads the Codex prompting guide to generate a tailored meta-prompt, then launches Codex CLI in an overlay. pi install npm:pi-interactive-shell

Quick demo of how I use Pi to offload tasks to Codex CLI with the pi-interactive-shell extension. Codex is notoriously tricky to prompt well so Pi reads the Codex prompting guide to generate a tailored meta-prompt, then launches Codex CLI in an overlay. pi install npm:pi-interactive-shell

Nico Bailon

36,088 views • 5 months ago

The most underrated feature of ChatGPT is Scheduled Tasks In any chat window, you can ask ChatGPT to perform a scheduled search - the frequency & research prompt are fully customizable. This feature is a MUCH more refined version of setting up news alerts on Google News or any other RSS feed. I have a couple of scheduled tasks for sectors which I find most interesting (e.g. Defense, Semiconductors, Data Centers & SpaceTech in India). The prompt template I use is linked below ⤵️

The most underrated feature of ChatGPT is Scheduled Tasks In any chat window, you can ask ChatGPT to perform a scheduled search - the frequency & research prompt are fully customizable. This feature is a MUCH more refined version of setting up news alerts on Google News or any other RSS feed. I have a couple of scheduled tasks for sectors which I find most interesting (e.g. Defense, Semiconductors, Data Centers & SpaceTech in India). The prompt template I use is linked below ⤵️

Rahul Mathur

10,678 views • 5 months ago

The Tory Government's decision to reject all six of Leeds' levelling-up bids is another kick in the teeth for people in our area. Levelling-up has been yet another example of Tory favouritism and of the Tories looking after their own.

The Tory Government's decision to reject all six of Leeds' levelling-up bids is another kick in the teeth for people in our area. Levelling-up has been yet another example of Tory favouritism and of the Tories looking after their own.

Richard Burgon MP

56,911 views • 3 years ago

The Mother of Rigged NBA drafts With all this talk of Adam Silver rigging the draft to compensate the Mavs for their participation in the collusion that produced Luka to LeBron, here is the mother of all rigged drafts, the 1985 Knicks Patrick Ewing draft

The Mother of Rigged NBA drafts With all this talk of Adam Silver rigging the draft to compensate the Mavs for their participation in the collusion that produced Luka to LeBron, here is the mother of all rigged drafts, the 1985 Knicks Patrick Ewing draft

KAREN PAIGE

441,981 views • 1 year ago

Your code, your rules. Gemini in Android Studio is here to help 🤝 Set up your coding style and tech stack using Rules in the Prompt Library, and watch Gemini adapt to generate code that fits your project →

Your code, your rules. Gemini in Android Studio is here to help 🤝 Set up your coding style and tech stack using Rules in the Prompt Library, and watch Gemini adapt to generate code that fits your project →

Android Developers

11,723 views • 10 months ago

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

157,390 views • 2 months ago

I think you need to understand how cool and good the Codex sub-agent feature is. For example, here is the perfect task for parallel agents: cleaning and refactoring components based on a skill. So I ask Codex to split the work to speed things up.

I think you need to understand how cool and good the Codex sub-agent feature is. For example, here is the perfect task for parallel agents: cleaning and refactoring components based on a skill. So I ask Codex to split the work to speed things up.

Thomas Ricouard

55,852 views • 5 months ago

Context caching with Gemini is so good! Here I am caching the entire Gemini Cookbook (around 400k tokens) as an insanely long prompt to create the best Gemini app developer on the planet. Watch Gemini answer any coding questions related to its own APIs.

Context caching with Gemini is so good! Here I am caching the entire Gemini Cookbook (around 400k tokens) as an insanely long prompt to create the best Gemini app developer on the planet. Watch Gemini answer any coding questions related to its own APIs.

Pietro Schirano

107,448 views • 2 years ago

So the Trump Administration is allegedly using money for Ground based nuclear missiles to Retrofit Trump’s Qatar plane in the amount of $ 934 million dollars! This is why Trump wanted to keep the amount of retrofitting it quiet! 😡😡😡

So the Trump Administration is allegedly using money for Ground based nuclear missiles to Retrofit Trump’s Qatar plane in the amount of $ 934 million dollars! This is why Trump wanted to keep the amount of retrofitting it quiet! 😡😡😡

Suzie rizzio

45,311 views • 11 months ago

🦞 Aniclaw — connect companions to your OpenClaw! Here is a quick demo of how it works. Conversions are flowing seamlessly even for long running tasks and it adds a lot of fun to the process. Link in the reply.

🦞 Aniclaw — connect companions to your OpenClaw! Here is a quick demo of how it works. Conversions are flowing seamlessly even for long running tasks and it adds a lot of fun to the process. Link in the reply.

Sergey Gonchar

110,521 views • 5 months ago

Another week, another example of why the stand rule is the worst rule to ever exist in the game of football How can people think that this is good for the game?? It is genuinely embarrassing for this great game of ours has turned into this? #AFLHawksDogs

Another week, another example of why the stand rule is the worst rule to ever exist in the game of football How can people think that this is good for the game?? It is genuinely embarrassing for this great game of ours has turned into this? #AFLHawksDogs

Drew Semmens

17,328 views • 1 month ago

Another day, another UI buff Orca’s support for tokens using the Interest Bearing Extension (IBE) means visible interest rates on tokens when trading and searching pools Here is a breakdown on what it is and how it ties into Orca with a live example 🧵

Another day, another UI buff Orca’s support for tokens using the Interest Bearing Extension (IBE) means visible interest rates on tokens when trading and searching pools Here is a breakdown on what it is and how it ties into Orca with a live example 🧵

Orca ☀️

32,389 views • 1 year ago

Gitlip Editors: Demo Launch 🌷 Gitlip is a collaborative coding platform, and today we’re launching the demo for Gitlip Editors - our take on collaborative coding environments. If you've ever wanted to collaborate on Markdown or code with the convenience of Google Docs, our demo is for you.

Gitlip Editors: Demo Launch 🌷 Gitlip is a collaborative coding platform, and today we’re launching the demo for Gitlip Editors - our take on collaborative coding environments. If you've ever wanted to collaborate on Markdown or code with the convenience of Google Docs, our demo is for you.

Natalie Marleny

35,862 views • 1 year ago

It’s time to cosy up to another game. Here is Tales of the Shire, coming to you in early 2025! Tales of the Shire #TheGameAwards

It’s time to cosy up to another game. Here is Tales of the Shire, coming to you in early 2025! Tales of the Shire #TheGameAwards

The Game Awards

136,623 views • 1 year ago

The demo for Kitty Kart 64 is OUT NOW! Explore scrapped tracks and surreal worlds in this kart (racing?) game. Link to play down below!

The demo for Kitty Kart 64 is OUT NOW! Explore scrapped tracks and surreal worlds in this kart (racing?) game. Link to play down below!

Andrew Brandenburg

32,113 views • 2 years ago