Loading video...

Video Failed to Load

Go Home

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why?...

223,682 views • 1 day ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. Kog just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

Rohan Paul

13,148 views • 1 month ago

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

157,390 views • 1 month ago

This is one of the craziest AI launches of 2026 and it came out of basically nowhere (Save this). A company called Subquadratic just shipped SubQ, and the benchmarks are almost hard to believe. To understand why this is such a big deal, you have to understand the fundamental problem that has defined AI for the last decade. Every large language model in existence is built on transformer architecture, and transformers use a mechanism called standard attention that checks every single word in a sequence against every other word. Double the context length and compute doesn't double, it quadruples, triple it and compute goes up nine times. This quadratic scaling is why frontier models have been stuck at roughly 1 million tokens, why running them at those lengths gets expensive fast, and why the AI labs have essentially been printing money charging you more the longer you need the model to think. The industry has known this problem existed since 2017 but they scaled it anyway. SubQ is built from the ground up to solve it. Instead of processing every possible token relationship, SubQ's sparse attention architecture identifies which relationships actually matter and ignores the rest meaning compute is used where it counts and wasted nowhere else. The result is that compute scales linearly with context length instead of exponentially, and the implications of that one architectural shift are enormous. At 12 million tokens, SubQ reduces attention compute by nearly 1,000x compared to standard frontier models and at 1 million tokens, it runs 52x faster than FlashAttention. And it does all of this while posting frontier level accuracy, scoring 95% on the RULER 128K long-context benchmark versus Claude Opus 4.6's 94.8%, and an 81.8 on SWE-Bench Verified coding tasks, besting Opus 4.6 (80.8) and DeepSeek 4.0 Pro. The cost comparison is where it gets genuinely insane. SubQ runs at under $1.50 per million tokens less than 5% of what Claude Opus charges. On the RULER benchmark, running the test with SubQ cost $8, running the same test with Claude Opus cost $2,600 and that's a 300x cost reduction at equivalent or better accuracy.. Subquadratic launched with $29 million in funding, SubQ is available today for early access via API, and SubQ Code, a coding agent built on the architecture ships alongside it. The transformer has been the unchallenged foundation of every major AI system since 2017. SubQ is the first serious evidence that something structurally better might have just arrived.

Milk Road AI

277,585 views • 1 month ago

🚨 BREAKING: THERE ARE RUMORS YOU CAN NOW CREATE "SAFE TOKENS" DIRECTLY ON ETHERVISTADEX What are "Safe Tokens"? "Safe Tokens" are tokens generated through our SafeTokenFactory smart contract. These tokens are designed to eliminate vulnerabilities such as mintable functions or scammy taxes and come with a standardized implementation. Before swapping, users can easily verify whether a token is "safe" or if additional caution is needed. This marks a significant step forward in enhancing the quality of projects launched on Ethervista. But does this compromise the customizability of ERC tokens? Not at all. The Ethervista Protocol smart contract allows for a limitless range of applications. Take the $VISTA contract, for example. It's a standard ERC20 token, but with the Ethervista Protocol smart contract, it automatically buys and burns tokens. Similar logic can be applied to any ERC20 token using EthervistaDEX’s unique Protocol feature. What other features would you like to see? Wen dashboards? Wen streaming? We're on it—we just hired a full-time full-stack engineer! Special shoutout to Bonzi - FIRST MEME and MASCOT @ Ethervista and Clippy - Microsoft Anti AI Helper @ Ethervista, the first whitelisted tokens. We will continue to strongly support tokens that burn part of their liquidity before the 5-day lock period and those with strong communities and utility. A final note to creators: We would like to emphasize that burning lp-tokens does not alter your share of rewards UNTIL you remove, add, or claim rewards, which automatically updates your pool share ratio based on your current balance and the total lp-supply, as outlined in our whitepaper. This DOES NOT affect protocol fees, which are used to support both the protocol and creators.

Ethervista

130,489 views • 1 year ago