Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why?... (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.show more

Avi Chawla

71,031 subscribers

223,682 views • 1 day ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

52,588 views • 2 months ago

Harrison Chase(LangChain CEO) just walked through four ways to give an agent memory. All four assume the model is still holding the right tokens. It isn't. At token 4,096 the cache ran a silent eviction nobody wrote. The user's name was in that batch. First founder to write the eviction policy ships a 100B agent that remembers a person.

Harrison Chase(LangChain CEO) just walked through four ways to give an agent memory. All four assume the model is still holding the right tokens. It isn't. At token 4,096 the cache ran a silent eviction nobody wrote. The user's name was in that batch. First founder to write the eviction policy ships a 100B agent that remembers a person.

Rohit

108,519 views • 2 months ago

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. Kog just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. Kog just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

Rohan Paul

13,148 views • 1 month ago

after turboquant and qwen3.5-35b-a3b, i got curious: how realistic is it to use kv cache as a document store today? to have vectorless, RAG-less search. so i prefilled 258K out of 262K context window on L4 (a budget GPU popular in prod). ~99% of the slot is pre-computed and stored, users load it on the fly in ~1s. system prompt + query append to the end, generation takes ~3K tokens, enough for search. at 99% fill rate, decoding runs ~20 tps on L4. i prepared some ego datasets (jina papers, which i know best), plus popular novels in chinese and english. the results are actually pretty good. some hallucination, but most answers are solid and well-grounded. what's more interesting is the cost: ~$0.26/h on L4 spot. single LLM. no vector database, no embedding model, no workflow/pipeline engineering. using kv cache as document store is nothing new, like the old CAG paper. but with quantized kv cache and modern attention (hybrid SSM-attention, GQA, MQA, MLA), the economics are changing fast. if we solve cold-prefill speed and decoding speed, and budget GPU costs keep dropping, the future of search could be vectorless. radical, but possible.

after turboquant and qwen3.5-35b-a3b, i got curious: how realistic is it to use kv cache as a document store today? to have vectorless, RAG-less search. so i prefilled 258K out of 262K context window on L4 (a budget GPU popular in prod). ~99% of the slot is pre-computed and stored, users load it on the fly in ~1s. system prompt + query append to the end, generation takes ~3K tokens, enough for search. at 99% fill rate, decoding runs ~20 tps on L4. i prepared some ego datasets (jina papers, which i know best), plus popular novels in chinese and english. the results are actually pretty good. some hallucination, but most answers are solid and well-grounded. what's more interesting is the cost: ~$0.26/h on L4 spot. single LLM. no vector database, no embedding model, no workflow/pipeline engineering. using kv cache as document store is nothing new, like the old CAG paper. but with quantized kv cache and modern attention (hybrid SSM-attention, GQA, MQA, MLA), the economics are changing fast. if we solve cold-prefill speed and decoding speed, and budget GPU costs keep dropping, the future of search could be vectorless. radical, but possible.

Han Xiao

42,344 views • 3 months ago

New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with Red Hat and taught by Cedric Clyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management. Skills you'll gain: - Quantize a model and measure the accuracy tradeoff - Serve a model with vLLM and watch it handle concurrent requests efficiently - Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy Join and learn to serve LLMs efficiently:

New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with Red Hat and taught by Cedric Clyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management. Skills you'll gain: - Quantize a model and measure the accuracy tradeoff - Serve a model with vLLM and watch it handle concurrent requests efficiently - Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy Join and learn to serve LLMs efficiently:

Andrew Ng

116,252 views • 24 days ago

$Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. Subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.$

Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. Subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.

Alexander Whedon

13,120,941 views • 1 month ago

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

157,390 views • 1 month ago

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

vLLM

39,959 views • 16 days ago

I actually discussed this on the podcast with Sholto Douglas and Trenton Bricken a year ago (6 months before o1)! "I don't think it's quite only transmitting that one token... during a forward pass, you create these KV values and then future steps attend to the KV values... all of those keys and values are information that you could use in future." "the tokens that you actually see in the chain-of-thought do not necessarily at all need to correspond to the vector representation that the model gets to see when it's deciding to attend to those tokens."

I actually discussed this on the podcast with Sholto Douglas and Trenton Bricken a year ago (6 months before o1)! "I don't think it's quite only transmitting that one token... during a forward pass, you create these KV values and then future steps attend to the KV values... all of those keys and values are information that you could use in future." "the tokens that you actually see in the chain-of-thought do not necessarily at all need to correspond to the vector representation that the model gets to see when it's deciding to attend to those tokens."

Dwarkesh Patel

57,446 views • 1 year ago

We asked James Wang why Cerebras can run large models so quickly. His answer: Inference speed is mostly a memory problem. Instead of constantly pulling weights from external memory, Cerebras splits the model across multiple wafers and pipelines the layers together. “Inference is all about memory bandwidth.” “You don’t want to store weights external to the wafer, because the second it’s outside, it becomes much slower.” “We put the weights on the inside.” “The team built a new software stack that basically lets us split the models by layer and store them layer by layer across multiple wafers.” “That allows us to never have to read from external memory, and that’s what makes it so fast.” “For context, Claude is about 100 tokens per second.”

We asked James Wang why Cerebras can run large models so quickly. His answer: Inference speed is mostly a memory problem. Instead of constantly pulling weights from external memory, Cerebras splits the model across multiple wafers and pipelines the layers together. “Inference is all about memory bandwidth.” “You don’t want to store weights external to the wafer, because the second it’s outside, it becomes much slower.” “We put the weights on the inside.” “The team built a new software stack that basically lets us split the models by layer and store them layer by layer across multiple wafers.” “That allows us to never have to read from external memory, and that’s what makes it so fast.” “For context, Claude is about 100 tokens per second.”

MTS

27,867 views • 1 month ago

Michaël van de Poppe says 99% of tokens have no purpose and no value: "NEAR is one of the prime tokens we hold because of the revenue it generates and the growth rate it has." "You can model it for the coming years and build a fuel where the current price is underpriced based on market circumstances, not the token itself." "Even Aave. We hold it, but ecosystem events can still produce a drawdown. That's why allocations stay small and most of the fund runs default trading."

Michaël van de Poppe says 99% of tokens have no purpose and no value: "NEAR is one of the prime tokens we hold because of the revenue it generates and the growth rate it has." "You can model it for the coming years and build a fuel where the current price is underpriced based on market circumstances, not the token itself." "Even Aave. We hold it, but ecosystem events can still produce a drawdown. That's why allocations stay small and most of the fund runs default trading."

The Rollup

20,913 views • 1 month ago

This is one of the craziest AI launches of 2026 and it came out of basically nowhere (Save this). A company called Subquadratic just shipped SubQ, and the benchmarks are almost hard to believe. To understand why this is such a big deal, you have to understand the fundamental problem that has defined AI for the last decade. Every large language model in existence is built on transformer architecture, and transformers use a mechanism called standard attention that checks every single word in a sequence against every other word. Double the context length and compute doesn't double, it quadruples, triple it and compute goes up nine times. This quadratic scaling is why frontier models have been stuck at roughly 1 million tokens, why running them at those lengths gets expensive fast, and why the AI labs have essentially been printing money charging you more the longer you need the model to think. The industry has known this problem existed since 2017 but they scaled it anyway. SubQ is built from the ground up to solve it. Instead of processing every possible token relationship, SubQ's sparse attention architecture identifies which relationships actually matter and ignores the rest meaning compute is used where it counts and wasted nowhere else. The result is that compute scales linearly with context length instead of exponentially, and the implications of that one architectural shift are enormous. At 12 million tokens, SubQ reduces attention compute by nearly 1,000x compared to standard frontier models and at 1 million tokens, it runs 52x faster than FlashAttention. And it does all of this while posting frontier level accuracy, scoring 95% on the RULER 128K long-context benchmark versus Claude Opus 4.6's 94.8%, and an 81.8 on SWE-Bench Verified coding tasks, besting Opus 4.6 (80.8) and DeepSeek 4.0 Pro. The cost comparison is where it gets genuinely insane. SubQ runs at under $1.50 per million tokens less than 5% of what Claude Opus charges. On the RULER benchmark, running the test with SubQ cost $8, running the same test with Claude Opus cost $2,600 and that's a 300x cost reduction at equivalent or better accuracy.. Subquadratic launched with $29 million in funding, SubQ is available today for early access via API, and SubQ Code, a coding agent built on the architecture ships alongside it. The transformer has been the unchallenged foundation of every major AI system since 2017. SubQ is the first serious evidence that something structurally better might have just arrived.

This is one of the craziest AI launches of 2026 and it came out of basically nowhere (Save this). A company called Subquadratic just shipped SubQ, and the benchmarks are almost hard to believe. To understand why this is such a big deal, you have to understand the fundamental problem that has defined AI for the last decade. Every large language model in existence is built on transformer architecture, and transformers use a mechanism called standard attention that checks every single word in a sequence against every other word. Double the context length and compute doesn't double, it quadruples, triple it and compute goes up nine times. This quadratic scaling is why frontier models have been stuck at roughly 1 million tokens, why running them at those lengths gets expensive fast, and why the AI labs have essentially been printing money charging you more the longer you need the model to think. The industry has known this problem existed since 2017 but they scaled it anyway. SubQ is built from the ground up to solve it. Instead of processing every possible token relationship, SubQ's sparse attention architecture identifies which relationships actually matter and ignores the rest meaning compute is used where it counts and wasted nowhere else. The result is that compute scales linearly with context length instead of exponentially, and the implications of that one architectural shift are enormous. At 12 million tokens, SubQ reduces attention compute by nearly 1,000x compared to standard frontier models and at 1 million tokens, it runs 52x faster than FlashAttention. And it does all of this while posting frontier level accuracy, scoring 95% on the RULER 128K long-context benchmark versus Claude Opus 4.6's 94.8%, and an 81.8 on SWE-Bench Verified coding tasks, besting Opus 4.6 (80.8) and DeepSeek 4.0 Pro. The cost comparison is where it gets genuinely insane. SubQ runs at under $1.50 per million tokens less than 5% of what Claude Opus charges. On the RULER benchmark, running the test with SubQ cost $8, running the same test with Claude Opus cost $2,600 and that's a 300x cost reduction at equivalent or better accuracy.. Subquadratic launched with $29 million in funding, SubQ is available today for early access via API, and SubQ Code, a coding agent built on the architecture ships alongside it. The transformer has been the unchallenged foundation of every major AI system since 2017. SubQ is the first serious evidence that something structurally better might have just arrived.

Milk Road AI

277,585 views • 1 month ago

TOKENFI LAUNCHPAD IS OFFICIALLY LIVE ON MAINNET TokenFi Launchpad is a decentralized launchpad for projects that want to raise funds for their crypto tokens. It is powered by $TOKEN as its main utility token on the BNB and ETH chains, with a 2% fee charged on funds raised by every project, 50% of which is used to buy and burn $TOKEN, making it perpetually deflationary. The very first project to go live on TokenFi Launchpad is the YakDAO token sale, and it is live now. You can find information on how to participate in the YakDAO token sale here:

TOKENFI LAUNCHPAD IS OFFICIALLY LIVE ON MAINNET TokenFi Launchpad is a decentralized launchpad for projects that want to raise funds for their crypto tokens. It is powered by $TOKEN as its main utility token on the BNB and ETH chains, with a 2% fee charged on funds raised by every project, 50% of which is used to buy and burn $TOKEN, making it perpetually deflationary. The very first project to go live on TokenFi Launchpad is the YakDAO token sale, and it is live now. You can find information on how to participate in the YakDAO token sale here:

TokenFi

113,700 views • 2 years ago

The next chapter about transformers is up on YouTube, digging into the attention mechanism: The model works with vectors representing tokens (think words), and this is the mechanism that allows those vectors to take in meaning from context.

The next chapter about transformers is up on YouTube, digging into the attention mechanism: The model works with vectors representing tokens (think words), and this is the mechanism that allows those vectors to take in meaning from context.

Grant Sanderson

810,175 views • 2 years ago

Nvidia CEO Jensen Huang on why engineers will soon be paid in tokens, not just salary: Jensen lays out a future where compute access becomes part of an engineer's compensation package. "I could totally imagine in the future every single engineer in our company will need an annual token budget," he says. He explains how the math would work: "They're going to make a few hundred thousand a year their base pay. I'm going to give them probably half of that on top of it as tokens so that they could be amplified 10x. Of course, we would." According to Huang, this is already changing how companies compete for talent: "It is now one of the recruiting tools in Silicon Valley. How many tokens comes along with my job?" His reasoning is simple: tokens make engineers more productive. As he puts it, "every engineer that has access to tokens will be more productive and those tokens as you know will be produced by AI factories that all of you and us we partner to build." Huang then zooms out to describe how this reshapes the nature of companies themselves: "Every single enterprise company in today sit on top of file systems and data centers. Every single software company of the future will be agentic and they will be token manufacturers. They be token users for their engineers and they'll be token manufacturers for all of their customers."

Nvidia CEO Jensen Huang on why engineers will soon be paid in tokens, not just salary: Jensen lays out a future where compute access becomes part of an engineer's compensation package. "I could totally imagine in the future every single engineer in our company will need an annual token budget," he says. He explains how the math would work: "They're going to make a few hundred thousand a year their base pay. I'm going to give them probably half of that on top of it as tokens so that they could be amplified 10x. Of course, we would." According to Huang, this is already changing how companies compete for talent: "It is now one of the recruiting tools in Silicon Valley. How many tokens comes along with my job?" His reasoning is simple: tokens make engineers more productive. As he puts it, "every engineer that has access to tokens will be more productive and those tokens as you know will be produced by AI factories that all of you and us we partner to build." Huang then zooms out to describe how this reshapes the nature of companies themselves: "Every single enterprise company in today sit on top of file systems and data centers. Every single software company of the future will be agentic and they will be token manufacturers. They be token users for their engineers and they'll be token manufacturers for all of their customers."

Big Brain AI

89,922 views • 1 month ago

TWO BOXES THE SIZE OF A MAC MINI JUST RAN A 235 BILLION PARAMETER MODEL ON A DESK It is two NVIDIA DGX Spark units linked by a single cable. A year ago a model this size meant renting a GPU cluster by the hour. Now it sits next to your monitor for around $8,000. Here is the twist most people miss. Linking them does not create one shared 256GB memory pool. The model is split across both boxes, and that is the only reason a 235B model fits at all. It answers at roughly 10 tokens per second, and both chips sit at just 74 degrees while sipping around 50 watts. Every token stays on the desk. Nothing touches a cloud, and nothing leaves the room. The ceiling for what you can run at home just jumped from 70B to 235B. Bookmark this & Watch it run ↓

TWO BOXES THE SIZE OF A MAC MINI JUST RAN A 235 BILLION PARAMETER MODEL ON A DESK It is two NVIDIA DGX Spark units linked by a single cable. A year ago a model this size meant renting a GPU cluster by the hour. Now it sits next to your monitor for around $8,000. Here is the twist most people miss. Linking them does not create one shared 256GB memory pool. The model is split across both boxes, and that is the only reason a 235B model fits at all. It answers at roughly 10 tokens per second, and both chips sit at just 74 degrees while sipping around 50 watts. Every token stays on the desk. Nothing touches a cloud, and nothing leaves the room. The ceiling for what you can run at home just jumped from 70B to 235B. Bookmark this & Watch it run ↓

slash1s

100,849 views • 14 days ago

SITUATION EXPLAINED: Cerebras raised $5.55 billion in their IPO and closing their first day of trading valued at $66 billion, making it the biggest US tech IPO since Snowflake in 2020. Cerebras makes Wafer-Scale Engine chips built for AI inference. We asked Sarah Fong the main difference between wafer-scale chips and traditional GPUs: - GPUs are great at parallel work (graphics, training) - AI inference is sequential, AKA one token at a time This causes the "memory wall" problem: - Every GPU core needs model weights, KV cache, and activations to do its math - On a GPU, that data lives in off-chip memory (HBM) - Cores constantly load and offload from off-chip memory, which is a huge bottleneck; hardware accounts for ~70% of inference latency Cerebras' chips: -Dinner-plate sized (vs. GPUs which are palm-sized) with tens of thousands of cores -Memory sits directly on top of the cores as distributed SRAM -Weights and KV cache can be accessed at on-chip speeds in the PB/s range, compared with off-chip speeds in the TB/s range achieved by GPUs with HBM.

SITUATION EXPLAINED: Cerebras raised $5.55 billion in their IPO and closing their first day of trading valued at $66 billion, making it the biggest US tech IPO since Snowflake in 2020. Cerebras makes Wafer-Scale Engine chips built for AI inference. We asked Sarah Fong the main difference between wafer-scale chips and traditional GPUs: - GPUs are great at parallel work (graphics, training) - AI inference is sequential, AKA one token at a time This causes the "memory wall" problem: - Every GPU core needs model weights, KV cache, and activations to do its math - On a GPU, that data lives in off-chip memory (HBM) - Cores constantly load and offload from off-chip memory, which is a huge bottleneck; hardware accounts for ~70% of inference latency Cerebras' chips: -Dinner-plate sized (vs. GPUs which are palm-sized) with tens of thousands of cores -Memory sits directly on top of the cores as distributed SRAM -Weights and KV cache can be accessed at on-chip speeds in the PB/s range, compared with off-chip speeds in the TB/s range achieved by GPUs with HBM.

MTS

44,057 views • 1 month ago

Your local AI just got up to 5x more memory. Same model. Same device. Nearly zero accuracy loss. QVAC SDK 0.12.0 integrates TurboQuant - Google Research's latest memory optimisation algorithm. What is TurboQuant? The KV cache is the memory your model uses to track a conversation. As context grows, it fills up fast. 32K tokens. 64K. Game over. TurboQuant compresses it up to 5x with no accuracy loss. What does it unlock for you? Your app had a 16K token ceiling? It's now 96K. On the same device. Just update the QVAC SDK to get up to 5x more efficiency. No code changes. All from one SDK. The TurboQuant integration unlocks sovereign intelligence for more people, on more devices. Learn more →

Your local AI just got up to 5x more memory. Same model. Same device. Nearly zero accuracy loss. QVAC SDK 0.12.0 integrates TurboQuant - Google Research's latest memory optimisation algorithm. What is TurboQuant? The KV cache is the memory your model uses to track a conversation. As context grows, it fills up fast. 32K tokens. 64K. Game over. TurboQuant compresses it up to 5x with no accuracy loss. What does it unlock for you? Your app had a 16K token ceiling? It's now 96K. On the same device. Just update the QVAC SDK to get up to 5x more efficiency. No code changes. All from one SDK. The TurboQuant integration unlocks sovereign intelligence for more people, on more devices. Learn more →

QVAC

15,798,688 views • 27 days ago

“Add an image for each row and watch it die” Every row loads an image from the network. In memory cache, no file cache. URL is random every time (uuid) so cache should always be a miss.

“Add an image for each row and watch it die” Every row loads an image from the network. In memory cache, no file cache. URL is random every time (uuid) so cache should always be a miss.

Donny Wals 👾

30,682 views • 4 months ago

🚨 BREAKING: THERE ARE RUMORS YOU CAN NOW CREATE "SAFE TOKENS" DIRECTLY ON ETHERVISTADEX What are "Safe Tokens"? "Safe Tokens" are tokens generated through our SafeTokenFactory smart contract. These tokens are designed to eliminate vulnerabilities such as mintable functions or scammy taxes and come with a standardized implementation. Before swapping, users can easily verify whether a token is "safe" or if additional caution is needed. This marks a significant step forward in enhancing the quality of projects launched on Ethervista. But does this compromise the customizability of ERC tokens? Not at all. The Ethervista Protocol smart contract allows for a limitless range of applications. Take the $VISTA contract, for example. It's a standard ERC20 token, but with the Ethervista Protocol smart contract, it automatically buys and burns tokens. Similar logic can be applied to any ERC20 token using EthervistaDEX’s unique Protocol feature. What other features would you like to see? Wen dashboards? Wen streaming? We're on it—we just hired a full-time full-stack engineer! Special shoutout to Bonzi - FIRST MEME and MASCOT @ Ethervista and Clippy - Microsoft Anti AI Helper @ Ethervista, the first whitelisted tokens. We will continue to strongly support tokens that burn part of their liquidity before the 5-day lock period and those with strong communities and utility. A final note to creators: We would like to emphasize that burning lp-tokens does not alter your share of rewards UNTIL you remove, add, or claim rewards, which automatically updates your pool share ratio based on your current balance and the total lp-supply, as outlined in our whitepaper. This DOES NOT affect protocol fees, which are used to support both the protocol and creators.

🚨 BREAKING: THERE ARE RUMORS YOU CAN NOW CREATE "SAFE TOKENS" DIRECTLY ON ETHERVISTADEX What are "Safe Tokens"? "Safe Tokens" are tokens generated through our SafeTokenFactory smart contract. These tokens are designed to eliminate vulnerabilities such as mintable functions or scammy taxes and come with a standardized implementation. Before swapping, users can easily verify whether a token is "safe" or if additional caution is needed. This marks a significant step forward in enhancing the quality of projects launched on Ethervista. But does this compromise the customizability of ERC tokens? Not at all. The Ethervista Protocol smart contract allows for a limitless range of applications. Take the $VISTA contract, for example. It's a standard ERC20 token, but with the Ethervista Protocol smart contract, it automatically buys and burns tokens. Similar logic can be applied to any ERC20 token using EthervistaDEX’s unique Protocol feature. What other features would you like to see? Wen dashboards? Wen streaming? We're on it—we just hired a full-time full-stack engineer! Special shoutout to Bonzi - FIRST MEME and MASCOT @ Ethervista and Clippy - Microsoft Anti AI Helper @ Ethervista, the first whitelisted tokens. We will continue to strongly support tokens that burn part of their liquidity before the 5-day lock period and those with strong communities and utility. A final note to creators: We would like to emphasize that burning lp-tokens does not alter your share of rewards UNTIL you remove, add, or claim rewards, which automatically updates your pool share ratio based on your current balance and the total lp-supply, as outlined in our whitepaper. This DOES NOT affect protocol fees, which are used to support both the protocol and creators.

Ethervista

130,489 views • 1 year ago