Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

The Cost of Intelligence is Heading to Zero | Hyperspace P2P Distributed Cache We present to you our breakthrough cross-domain work across AI, distributed systems, cryptography, game theory to solve the primary structural inefficiency at the heart of AI infrastructure: most inference is redundant. Google has reported that only... 15% of daily searches are truly novel. The rest are repeats or close variants. LLM inference inherits this same power-law distribution. Enterprise chatbots see 70-80% of queries fall into a handful of intent categories. System prompts are identical across 100% of requests within an application. The KV attention state for "You are a helpful assistant" has been computed billions of times, on millions of GPUs, identically. And yet every AI lab, every startup, every self-hosted deployment - computes and caches these results independently. There is no shared layer. No global memory. Every provider pays the full compute cost for every query, even when the answer already exists somewhere in the network. This is the problem Hyperspace solves where distributed cache operates at three levels, each catching a different class of redundancy: 1. Response cache Same prompt, same model, same parameters - instant cached response from any node in the network. SHA-256 hash lookup via DHT, with cryptographic cache proofs linking every response to its original inference execution. No trust required. Fetchers re-announce as providers, so popular responses replicate naturally across more nodes. 2. KV prefix cache Same system prompt tokens - skip the most expensive part of inference entirely. Prefill (computing Key-Value attention states) is deterministic: same model plus same tokens always produces identical KV state. The network caches these states using erasure coding and distributes them via the routing network. New questions that share a common prefix resume generation from cached state instead of recomputing from scratch. 3. Routing to cached nodes Instead of transferring KV state across the network for every request, Hyperspace routes the request to the node that already has the state loaded in VRAM. The request goes to the cache, not the cache to the request. Together, these three layers mean that 70-90% of inference requests at network scale never require full GPU computation. This work doesn't exist in isolation. It builds on research from across the industry: SGLang's RadixAttention demonstrated that automatic prefix sharing can yield up to 5x speedup on structured LLM workloads. Moonshot AI's Mooncake built an entire KV-cache-centric disaggregated architecture for production serving at Kimi. Anthropic, OpenAI, and Google all launched prompt caching products in 2024 - priced at 50-90% discounts - because system prompt reuse is so pervasive that it changes the economics of inference. What all of these systems share is a common limitation: they operate within a single organization's infrastructure. SGLang caches prefixes within one server. Mooncake disaggregates KV cache within one datacenter. Anthropic's prompt caching works within one API provider's fleet. None of them can share cached state across organizational boundaries. Hyperspace removes this boundary. The cache is global. A response computed by a node in Tokyo is immediately available to a node in Berlin. A KV prefix state generated for Qwen-32B on one machine is verifiable and reusable by any other machine running the same model. The routing network provides the delivery guarantees, the erasure coding provides the redundancy, and the cache proofs provide the trust. What this means for the cost of intelligence Big AI labs scale linearly: twice the users means twice the GPU spend. Every query is a cost center. Their internal caching helps, but it's siloed - Lab A's cache can't serve Lab B's users, and neither can serve a self-hosted Llama deployment. Hyperspace scales sub-linearly. Every new node that joins the network adds to the global cache. Every inference result enriches the cache for all future requests. The cache hit rate rises with network size because query distributions follow a power law - the most common questions are asked exponentially more often than rare ones. The implication is simple: as the network grows, the effective cost per inference drops. Not linearly. Logarithmically. At 10 million nodes, we estimate 75-90% of all inference requests can be served from cache, eliminating 400,000+ MWh of energy consumption per year and avoiding over 200,000 tons of CO2 emissions. The first person to ask a question pays the compute cost. Everyone after them gets the answer for free, with cryptographic proof that it's authentic. Training is competitive. Inference is shared Open-weight models are converging on quality with closed models. Labs will continue to differentiate on training - data curation, architecture innovation, RLHF tuning. That's where the real intellectual property lives. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte, regardless of whose GPU runs the matrix multiplication. There is no moat in multiplying matrices. The moat is in training the weights. A global distributed cache makes this separation explicit. It doesn't matter who trained the model. Once the weights are open, the inference cost approaches zero at scale - because the network remembers every answer and can prove it's correct. No lab, no matter how well-funded, can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically. The marginal cost of intelligence approaches zero. That's the endgame.show more

Varun

35,587 subscribers

37,272 Aufrufe • vor 3 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

New course: Efficient Inference with SGLang: Text and Image Generation, built in partnership with LMSys LMSYS Org and RadixArk RadixArk, and taught by Richard Chen Richard Chen, a Member of Technical Staff at RadixArk. Running LLMs in production is expensive, and much of that cost comes from redundant computation. This short course teaches you to eliminate that waste using SGLang, an open-source inference framework that caches computation already done and reuses it across future requests. When ten users share the same system prompt, SGLang processes it once, not ten times. The speedups compound quickly, especially when there's a lot of shared context across requests. Skills you'll gain: - Implement a KV cache from scratch to eliminate redundant computation within a single request - Scale caching across users and requests with RadixAttention, so shared context is only processed once - Accelerate image generation with diffusion models using SGLang's caching and multi-GPU parallelism Join and learn to make LLM inference faster and more cost-efficient at scale!

New course: Efficient Inference with SGLang: Text and Image Generation, built in partnership with LMSys LMSYS Org and RadixArk RadixArk, and taught by Richard Chen Richard Chen, a Member of Technical Staff at RadixArk. Running LLMs in production is expensive, and much of that cost comes from redundant computation. This short course teaches you to eliminate that waste using SGLang, an open-source inference framework that caches computation already done and reuses it across future requests. When ten users share the same system prompt, SGLang processes it once, not ten times. The speedups compound quickly, especially when there's a lot of shared context across requests. Skills you'll gain: - Implement a KV cache from scratch to eliminate redundant computation within a single request - Scale caching across users and requests with RadixAttention, so shared context is only processed once - Accelerate image generation with diffusion models using SGLang's caching and multi-GPU parallelism Join and learn to make LLM inference faster and more cost-efficient at scale!

Andrew Ng

97,822 Aufrufe • vor 2 Monaten

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

52,588 Aufrufe • vor 2 Monaten

SITUATION EXPLAINED: Cerebras raised $5.55 billion in their IPO and closing their first day of trading valued at $66 billion, making it the biggest US tech IPO since Snowflake in 2020. Cerebras makes Wafer-Scale Engine chips built for AI inference. We asked Sarah Fong the main difference between wafer-scale chips and traditional GPUs: - GPUs are great at parallel work (graphics, training) - AI inference is sequential, AKA one token at a time This causes the "memory wall" problem: - Every GPU core needs model weights, KV cache, and activations to do its math - On a GPU, that data lives in off-chip memory (HBM) - Cores constantly load and offload from off-chip memory, which is a huge bottleneck; hardware accounts for ~70% of inference latency Cerebras' chips: -Dinner-plate sized (vs. GPUs which are palm-sized) with tens of thousands of cores -Memory sits directly on top of the cores as distributed SRAM -Weights and KV cache can be accessed at on-chip speeds in the PB/s range, compared with off-chip speeds in the TB/s range achieved by GPUs with HBM.

SITUATION EXPLAINED: Cerebras raised $5.55 billion in their IPO and closing their first day of trading valued at $66 billion, making it the biggest US tech IPO since Snowflake in 2020. Cerebras makes Wafer-Scale Engine chips built for AI inference. We asked Sarah Fong the main difference between wafer-scale chips and traditional GPUs: - GPUs are great at parallel work (graphics, training) - AI inference is sequential, AKA one token at a time This causes the "memory wall" problem: - Every GPU core needs model weights, KV cache, and activations to do its math - On a GPU, that data lives in off-chip memory (HBM) - Cores constantly load and offload from off-chip memory, which is a huge bottleneck; hardware accounts for ~70% of inference latency Cerebras' chips: -Dinner-plate sized (vs. GPUs which are palm-sized) with tens of thousands of cores -Memory sits directly on top of the cores as distributed SRAM -Weights and KV cache can be accessed at on-chip speeds in the PB/s range, compared with off-chip speeds in the TB/s range achieved by GPUs with HBM.

MTS

44,057 Aufrufe • vor 1 Monat

Marc Brooker (AWS Distinguished Eng): "The downside of caches, especially in distributed systems, is they have this mode, where the cache is empty or contains the wrong data. The system is slow, often down, because now the backend isn't scaled to deal with all of this uncached traffic. Customers are very disappointed and often it is down in a stable way. Like it's still it's down, but it's not going to come back up under its own energy. Because, for example, all of this traffic is causing a huge amount of contention in my database or is saturating the network and so I can't even refill the cache. It's not even getting the right kind of data in In general, I prefer to see the teams around me avoiding caching where possible." Marc Brooker

Marc Brooker (AWS Distinguished Eng): "The downside of caches, especially in distributed systems, is they have this mode, where the cache is empty or contains the wrong data. The system is slow, often down, because now the backend isn't scaled to deal with all of this uncached traffic. Customers are very disappointed and often it is down in a stable way. Like it's still it's down, but it's not going to come back up under its own energy. Because, for example, all of this traffic is causing a huge amount of contention in my database or is saturating the network and so I can't even refill the cache. It's not even getting the right kind of data in In general, I prefer to see the teams around me avoiding caching where possible." Marc Brooker

Ryan Peterman

136,399 Aufrufe • vor 2 Monaten

The Way Forward & A New Chapter For AI powered by GamerHash AI & $GHX. Welcome to the future of bleeding edge connectivity and cloud infrastructure ⚡ GamerHash AI is the answer to the billion dollar question on how to offset the GPU scarcity, enable exponential growth of AI through utilizing consumer-level GPUs and establishing a powerful sharing economy by sourcing a global network of gamers. The synergy of our GPU #DePIN and blockchain for #AI #inference is the technology of the future that already works today. See you in the network!

The Way Forward & A New Chapter For AI powered by GamerHash AI & $GHX. Welcome to the future of bleeding edge connectivity and cloud infrastructure ⚡ GamerHash AI is the answer to the billion dollar question on how to offset the GPU scarcity, enable exponential growth of AI through utilizing consumer-level GPUs and establishing a powerful sharing economy by sourcing a global network of gamers. The synergy of our GPU #DePIN and blockchain for #AI #inference is the technology of the future that already works today. See you in the network!

GamerHash AI

1,866,164 Aufrufe • vor 1 Jahr

It is humbling to watch how people are connecting to the Hyperspace network from every corner of the world. Right now, there are over 30,000 live connected nodes ready to serve AI to the world. This is madness! Thank you to everyone running a node 🫡 PS: This is after recent stability fixes on both the network and the points system. We intend to do better.

It is humbling to watch how people are connecting to the Hyperspace network from every corner of the world. Right now, there are over 30,000 live connected nodes ready to serve AI to the world. This is madness! Thank you to everyone running a node 🫡 PS: This is after recent stability fixes on both the network and the points system. We intend to do better.

Varun

28,129 Aufrufe • vor 1 Jahr

I pay Claude $20 a month. Most $TAO holders do too. There is a stack you can build in 15 minutes that fixes that completely. It runs on Bittensor. It costs $10. You do not write a single line of code. Here is how every AI chat product actually works under the hood. Three layers. Always three. The model. The brain. GPT, Claude, DeepSeek, Kimi, GLM. The inference layer. The GPU that runs the model when you hit send. The interface. The chat box you actually look at. ChatGPT and Claude bundle all three and hand you the result. You cannot change the model. You cannot change the inference. The interface is non-negotiable. Every prompt you type goes to a server run by a private company whose terms of service can quietly change next month. The anti-ChatGPT move is to pick each layer yourself. This is where $TAO comes in. Chutes is Subnet 64 on Bittensor. It is the inference layer. Open source models like DeepSeek, Kimi, GLM, and Llama get served by a global network of miner-operated GPUs. Validators score the output quality. The best inference wins the emissions. You hit send. A miner somewhere runs your prompt. You get the answer back. The TAO you hold is in part paying for the GPU you just used. The basic stack is one URL. chutes. ai/chat No account. No API key. No setup. Switch models mid-conversation. Web search built in. Image generation. File uploads. Free. The advanced stack is Chutes plus TypingMind. One-time license. No recurring fee. Plugins, agents, custom personas, a prompt library you build over months. Full model switching between Chutes, OpenAI, and Anthropic from the same window. Total cost: $10 a month to Chutes for inference. That $10 buys you $50 in actual usage. But here is the signal most people missed inside this story. Chutes ran a free tier until February. Then they killed it. Then they raised the minimum to $10 in May. Most people saw that as bad news. It is the opposite. Free things on the internet do not last. Real products do. Chutes is becoming a real product. A subnet that generates actual revenue from actual users paying actual money for actual AI inference. That is what $43 million in Q1 network revenue looks like at the individual subnet level. And there is one more thing ChatGPT and Claude cannot offer that Chutes already has. Trusted Execution Environments. Your prompt gets encrypted on your device, shipped to a confidential compute GPU, and the lock only breaks inside the chip. The miner running the model physically cannot read your prompt. ChatGPT cannot promise that. Claude cannot promise that. Bittensor already built it. You are holding a network where the subnets are generating real revenue, shipping real privacy infrastructure, and replacing $20 a month centralised subscriptions with $10 a month decentralised inference. The people who use the product always understand the investment better than the people who only watch the price.

I pay Claude $20 a month. Most $TAO holders do too. There is a stack you can build in 15 minutes that fixes that completely. It runs on Bittensor. It costs $10. You do not write a single line of code. Here is how every AI chat product actually works under the hood. Three layers. Always three. The model. The brain. GPT, Claude, DeepSeek, Kimi, GLM. The inference layer. The GPU that runs the model when you hit send. The interface. The chat box you actually look at. ChatGPT and Claude bundle all three and hand you the result. You cannot change the model. You cannot change the inference. The interface is non-negotiable. Every prompt you type goes to a server run by a private company whose terms of service can quietly change next month. The anti-ChatGPT move is to pick each layer yourself. This is where $TAO comes in. Chutes is Subnet 64 on Bittensor. It is the inference layer. Open source models like DeepSeek, Kimi, GLM, and Llama get served by a global network of miner-operated GPUs. Validators score the output quality. The best inference wins the emissions. You hit send. A miner somewhere runs your prompt. You get the answer back. The TAO you hold is in part paying for the GPU you just used. The basic stack is one URL. chutes. ai/chat No account. No API key. No setup. Switch models mid-conversation. Web search built in. Image generation. File uploads. Free. The advanced stack is Chutes plus TypingMind. One-time license. No recurring fee. Plugins, agents, custom personas, a prompt library you build over months. Full model switching between Chutes, OpenAI, and Anthropic from the same window. Total cost: $10 a month to Chutes for inference. That $10 buys you $50 in actual usage. But here is the signal most people missed inside this story. Chutes ran a free tier until February. Then they killed it. Then they raised the minimum to $10 in May. Most people saw that as bad news. It is the opposite. Free things on the internet do not last. Real products do. Chutes is becoming a real product. A subnet that generates actual revenue from actual users paying actual money for actual AI inference. That is what $43 million in Q1 network revenue looks like at the individual subnet level. And there is one more thing ChatGPT and Claude cannot offer that Chutes already has. Trusted Execution Environments. Your prompt gets encrypted on your device, shipped to a confidential compute GPU, and the lock only breaks inside the chip. The miner running the model physically cannot read your prompt. ChatGPT cannot promise that. Claude cannot promise that. Bittensor already built it. You are holding a network where the subnets are generating real revenue, shipping real privacy infrastructure, and replacing $20 a month centralised subscriptions with $10 a month decentralised inference. The people who use the product always understand the investment better than the people who only watch the price.

2xnmore

26,871 Aufrufe • vor 1 Monat

“Add an image for each row and watch it die” Every row loads an image from the network. In memory cache, no file cache. URL is random every time (uuid) so cache should always be a miss.

“Add an image for each row and watch it die” Every row loads an image from the network. In memory cache, no file cache. URL is random every time (uuid) so cache should always be a miss.

Donny Wals 👾

30,682 Aufrufe • vor 4 Monaten

$70,000 Phones, One AI Agent — The World's Largest Edge AI Fleet Runs on Hermes We turned 70,000 phones into a shared AI compute network. Any device owner contributes idle compute. Any developer taps distributed inference at a fraction of cloud cost. Not a concept. Not a whitepaper. 70K devices online today. The problem: orchestrating a shared network of heterogeneous edge devices — different chipsets, different memory, different thermal profiles, different owners — is a coordination nightmare no human team can handle manually. So we gave the network a brain: Nous Research Hermes Agent. Hermes connects to 16 MCP servers and runs 24/7: 🔬 Research Loop — Tracks every breakthrough in on-device inference: quantization (GPTQ/AWQ/GGUF), speculative decoding on mobile SoCs, federated learning protocols. Auto-imports papers into NotebookLM. 36 research topics, zero manual curation. 🌐 Network Intelligence — Monitors device availability, compute capacity, and workload distribution across the shared fleet. Surfaces bottlenecks before they cascade. 🧬 Tech Tree Optimizer — Maps the full optimization frontier: from KV-cache compression to on-device LoRA to peer-to-peer model sharding. Hermes autonomously identifies which research paths unlock the most network-wide throughput gains. The result: a self-improving shared compute network. Research compounds daily. The fleet gets smarter without human intervention. Cloud AI scales with money. We scale with people. #HermesHackathon Teknium 🪽 Delphi Digital Tommy$

70,000 Phones, One AI Agent — The World's Largest Edge AI Fleet Runs on Hermes We turned 70,000 phones into a shared AI compute network. Any device owner contributes idle compute. Any developer taps distributed inference at a fraction of cloud cost. Not a concept. Not a whitepaper. 70K devices online today. The problem: orchestrating a shared network of heterogeneous edge devices — different chipsets, different memory, different thermal profiles, different owners — is a coordination nightmare no human team can handle manually. So we gave the network a brain: Nous Research Hermes Agent. Hermes connects to 16 MCP servers and runs 24/7: 🔬 Research Loop — Tracks every breakthrough in on-device inference: quantization (GPTQ/AWQ/GGUF), speculative decoding on mobile SoCs, federated learning protocols. Auto-imports papers into NotebookLM. 36 research topics, zero manual curation. 🌐 Network Intelligence — Monitors device availability, compute capacity, and workload distribution across the shared fleet. Surfaces bottlenecks before they cascade. 🧬 Tech Tree Optimizer — Maps the full optimization frontier: from KV-cache compression to on-device LoRA to peer-to-peer model sharding. Hermes autonomously identifies which research paths unlock the most network-wide throughput gains. The result: a self-improving shared compute network. Research compounds daily. The fleet gets smarter without human intervention. Cloud AI scales with money. We scale with people. #HermesHackathon Teknium 🪽 Delphi Digital Tommy

Oyster Republic 🦪📲🦞👓

20,703 Aufrufe • vor 3 Monaten

Gavin Baker (Gavin Baker) says the disaggregation of inference can extend GPU useful lives from 3-4 years to 10-15. That may single-handedly save private credit and reduce the financing rates for GPUs, which will drive demand and help finance the build-out. "The disaggregation of prefill and inference is going to be amazing for the useful lives of GPU and may single-handedly save private credit. Private credit is in pain from these SaaS loans. But there's a lot of private credit in GPUs too. They were underwriting that to 3-4. The disaggregation of inference means that these GPUs are going to have 10 or 15-year lives. The AI skeptics are like, "Oh, these companies are all cooking their books. The useful life of a GPU is only a year or two. The useful life of a CPU is only four years because the rapid technological change." No. What rapid technological change has done with the disaggregation of prefill and inference is you can put a Cerebras system or Groq LPUs effectively in front of a Hopper or even an Ampere, use that Hopper and Ampere for prefill, and extend the useful life of that GPU until it melts. This is going to be really good for the whole private credit industry. It's gonna help finance the AI build-out. Because if you can start to finance GPUs at 5% or 6% instead of – I think CoreWeave's lowest financing was low sevens – that actually mathematically changes the cost to finance this build-out."

Gavin Baker (Gavin Baker) says the disaggregation of inference can extend GPU useful lives from 3-4 years to 10-15. That may single-handedly save private credit and reduce the financing rates for GPUs, which will drive demand and help finance the build-out. "The disaggregation of prefill and inference is going to be amazing for the useful lives of GPU and may single-handedly save private credit. Private credit is in pain from these SaaS loans. But there's a lot of private credit in GPUs too. They were underwriting that to 3-4. The disaggregation of inference means that these GPUs are going to have 10 or 15-year lives. The AI skeptics are like, "Oh, these companies are all cooking their books. The useful life of a GPU is only a year or two. The useful life of a CPU is only four years because the rapid technological change." No. What rapid technological change has done with the disaggregation of prefill and inference is you can put a Cerebras system or Groq LPUs effectively in front of a Hopper or even an Ampere, use that Hopper and Ampere for prefill, and extend the useful life of that GPU until it melts. This is going to be really good for the whole private credit industry. It's gonna help finance the AI build-out. Because if you can start to finance GPUs at 5% or 6% instead of – I think CoreWeave's lowest financing was low sevens – that actually mathematically changes the cost to finance this build-out."

Invest Like the Best

206,845 Aufrufe • vor 1 Monat

Think of OptimAI as a living network - hundred thousands of nodes sensing the web, computing knowledge, verifying truth. Lite Nodes → surface signals & activity Edge Nodes → compute at the edge, background mining Core Nodes → crawl, embed, index, inference, validate & serve workloads Distributed across 179 countries, approaching 800,000 total installations and more than 6,000 Core Nodes online. Every node strengthens collective intelligence. Every contribution expands the data layer powering Agentic AI. This is AI not controlled, but co-created. Built for the People, by the People. Start your node and take part in the next era of intelligence:

Think of OptimAI as a living network - hundred thousands of nodes sensing the web, computing knowledge, verifying truth. Lite Nodes → surface signals & activity Edge Nodes → compute at the edge, background mining Core Nodes → crawl, embed, index, inference, validate & serve workloads Distributed across 179 countries, approaching 800,000 total installations and more than 6,000 Core Nodes online. Every node strengthens collective intelligence. Every contribution expands the data layer powering Agentic AI. This is AI not controlled, but co-created. Built for the People, by the People. Start your node and take part in the next era of intelligence:

OptimAI Network

19,365 Aufrufe • vor 6 Monaten

Chamath said AI is not like the internet. Every new user costs real money. And the infrastructure making it possible was built by everyone. His argument was the clearest case for government ownership of AI labs I have ever heard. And it had nothing to do with Bernie Sanders. Start with the internet comparison. Google and Facebook became the most profitable companies in human history because of one number. The marginal cost of adding a new user was effectively zero. One more search query cost Google nothing. One more Facebook profile cost Meta nothing. They could serve a billion people and the incremental cost of that billion person was rounding error. That is the money printer. Infinite scale at zero marginal cost. AI breaks that model completely. Every single user taxes a GPU. Every query costs electricity. Every response requires memory and compute. The marginal cost of AI is real, significant, and does not disappear at scale. You cannot print money the same way. Then Chamath made the point that landed hardest. The infrastructure these companies depend on, the power grid, the land, the data centers, the permitting, the national security apparatus that protects their chips from being stolen, none of that was built by Anthropic or OpenAI. It was built by the public. By taxpayers. By decades of government investment in the physical and legal foundation these companies are now running on. He compared it to the interstate highway system. If the federal government built the roads and two companies transported all the goods on them, a logical question at that point would be how much of that should I own? You are riding on my rails. His conclusion was direct. If he were running a sovereign wealth fund and had the negotiating leverage of the US government, he would own 75% of these companies when he was done. The internet had zero marginal cost. That is why the founders captured almost all of the value. AI has real marginal cost and runs on public infrastructure. That changes who has a claim on what gets built. WATCH THE FULL PODCAST ON The All-In Podcast

Ihtesham Ali

78,878 Aufrufe • vor 12 Tagen

$Verifiable computing is the key. One node does the complex work. The entire network verifies it in a blink—for a fraction of the cost. This is the magic of ZK proofs, and the foundation of Brevis. Infinite scalability is here.$

Verifiable computing is the key. One node does the complex work. The entire network verifies it in a blink—for a fraction of the cost. This is the magic of ZK proofs, and the foundation of Brevis. Infinite scalability is here.

Brevis

101,539 Aufrufe • vor 8 Monaten

If intelligence is the log of compute… it starts with a lot of compute! And that’s why we’re scaling our GPU fleet faster than anyone else. Just last year, we added over 2 gigawatts of new capacity – roughly the output of 2 nuclear power plants. And today we’re going further, announcing the world's most powerful AI datacenter, located in southeastern Wisconsin. Fairwater is a seamless cluster of hundreds of thousands of NVIDIA GB200s, connected by enough fiber to circle the Earth 4.5 times. It will deliver 10x the performance of the world’s fastest supercomputer today, enabling AI training and inference workloads at a level never before seen. For AI training workloads, you need compute at exponential scale. That’s why we designed the datacenter, GPU fleet, and network together as one integrated system. This ensures a single job can run from day 1 at exponential scale across thousands of GPUs. Fairwater uses a liquid-cooled closed-loop system for cooling GPUs that requires zero water for operations after construction. And we’re matching all of the energy that is consumed with renewable sources. And of course, it is just one of several similar sites we’re lighting up across our 70+ regions. We have multiple identical Fairwater datacenters under construction in other locations across the US, in addition to our AI infrastructure already deployed in over 100 datacenters around the world, powering model training, test-time compute, RL tuning, and real-time inference at global scale. Too often during times like this, people go with the current and only later wonder, how did we get here? With Fairwater, we're charting a new path: doing the hard engineering work, bringing compute, network, and storage into one highly scaled cluster, and designing closed-loop energy systems to meet real-world computing needs. And partnering with local communities to ensure it's thoughtfully done in a way that is sustainable, creates new jobs, and expands opportunity. We are thrilled to see this take hold in Wisconsin, and we are just getting started.

If intelligence is the log of compute… it starts with a lot of compute! And that’s why we’re scaling our GPU fleet faster than anyone else. Just last year, we added over 2 gigawatts of new capacity – roughly the output of 2 nuclear power plants. And today we’re going further, announcing the world's most powerful AI datacenter, located in southeastern Wisconsin. Fairwater is a seamless cluster of hundreds of thousands of NVIDIA GB200s, connected by enough fiber to circle the Earth 4.5 times. It will deliver 10x the performance of the world’s fastest supercomputer today, enabling AI training and inference workloads at a level never before seen. For AI training workloads, you need compute at exponential scale. That’s why we designed the datacenter, GPU fleet, and network together as one integrated system. This ensures a single job can run from day 1 at exponential scale across thousands of GPUs. Fairwater uses a liquid-cooled closed-loop system for cooling GPUs that requires zero water for operations after construction. And we’re matching all of the energy that is consumed with renewable sources. And of course, it is just one of several similar sites we’re lighting up across our 70+ regions. We have multiple identical Fairwater datacenters under construction in other locations across the US, in addition to our AI infrastructure already deployed in over 100 datacenters around the world, powering model training, test-time compute, RL tuning, and real-time inference at global scale. Too often during times like this, people go with the current and only later wonder, how did we get here? With Fairwater, we're charting a new path: doing the hard engineering work, bringing compute, network, and storage into one highly scaled cluster, and designing closed-loop energy systems to meet real-world computing needs. And partnering with local communities to ensure it's thoughtfully done in a way that is sustainable, creates new jobs, and expands opportunity. We are thrilled to see this take hold in Wisconsin, and we are just getting started.

Satya Nadella

2,019,532 Aufrufe • vor 9 Monaten

Every AI product running in production depends on an inference system someone had to engineer, optimize and more. Scheduling, batching, routing, cost per token. This is a craft. The Inference Frontier Program spotlights the builders behind that work. 💡 Watch the video and nominate a team:

Every AI product running in production depends on an inference system someone had to engineer, optimize and more. Scheduling, batching, routing, cost per token. This is a craft. The Inference Frontier Program spotlights the builders behind that work. 💡 Watch the video and nominate a team:

Nebius

54,224 Aufrufe • vor 3 Monaten

Destra Edge — $DSYNC From Idea to Reality Yesterday, we introduced Destra Edge. Today, we show you what it means. For too long, AI has lived in data centers. Expensive. Centralized. Controlled by a few. Meanwhile, billions of smartphones sit idle. Powerful GPUs. Doing nothing. That doesn’t make sense. So we’re building something different. Destra Edge turns everyday smartphones into a global AI inference network. No servers. No cloud monopolies. Just real compute at the edge. This video shows the system taking shape — how inference is distributed, how it runs on-device, and how results come together without central control. This is not finished. And that’s the point. You’re watching it being built. Layer by layer. In public. The magic is in the scale. More phones means more power. More power means stronger AI. The network grows simply because people join. No passive mining. No fake activity. Only real GPU inference. Measured. Verified. Rewarded. AI shouldn’t live in warehouses. It should live in your pocket. This is edge-level decentralization. This is permissionless inference. This is Destra Network.

Destra Edge — $DSYNC From Idea to Reality Yesterday, we introduced Destra Edge. Today, we show you what it means. For too long, AI has lived in data centers. Expensive. Centralized. Controlled by a few. Meanwhile, billions of smartphones sit idle. Powerful GPUs. Doing nothing. That doesn’t make sense. So we’re building something different. Destra Edge turns everyday smartphones into a global AI inference network. No servers. No cloud monopolies. Just real compute at the edge. This video shows the system taking shape — how inference is distributed, how it runs on-device, and how results come together without central control. This is not finished. And that’s the point. You’re watching it being built. Layer by layer. In public. The magic is in the scale. More phones means more power. More power means stronger AI. The network grows simply because people join. No passive mining. No fake activity. Only real GPU inference. Measured. Verified. Rewarded. AI shouldn’t live in warehouses. It should live in your pocket. This is edge-level decentralization. This is permissionless inference. This is Destra Network.

Destra Network

20,043 Aufrufe • vor 5 Monaten

Xahau is the blockchain of choice powering payments between Europe and Africa. The first stop is Ethiopia, where TerraPay is using the power of the network for instant settlement. It's only a matter of time before Xahau is used across the continent, to help deliver the same speed, scale and cost advantages.

Xahau is the blockchain of choice powering payments between Europe and Africa. The first stop is Ethiopia, where TerraPay is using the power of the network for instant settlement. It's only a matter of time before Xahau is used across the continent, to help deliver the same speed, scale and cost advantages.

Xahau Network

11,210 Aufrufe • vor 1 Monat

after turboquant and qwen3.5-35b-a3b, i got curious: how realistic is it to use kv cache as a document store today? to have vectorless, RAG-less search. so i prefilled 258K out of 262K context window on L4 (a budget GPU popular in prod). ~99% of the slot is pre-computed and stored, users load it on the fly in ~1s. system prompt + query append to the end, generation takes ~3K tokens, enough for search. at 99% fill rate, decoding runs ~20 tps on L4. i prepared some ego datasets (jina papers, which i know best), plus popular novels in chinese and english. the results are actually pretty good. some hallucination, but most answers are solid and well-grounded. what's more interesting is the cost: ~$0.26/h on L4 spot. single LLM. no vector database, no embedding model, no workflow/pipeline engineering. using kv cache as document store is nothing new, like the old CAG paper. but with quantized kv cache and modern attention (hybrid SSM-attention, GQA, MQA, MLA), the economics are changing fast. if we solve cold-prefill speed and decoding speed, and budget GPU costs keep dropping, the future of search could be vectorless. radical, but possible.

after turboquant and qwen3.5-35b-a3b, i got curious: how realistic is it to use kv cache as a document store today? to have vectorless, RAG-less search. so i prefilled 258K out of 262K context window on L4 (a budget GPU popular in prod). ~99% of the slot is pre-computed and stored, users load it on the fly in ~1s. system prompt + query append to the end, generation takes ~3K tokens, enough for search. at 99% fill rate, decoding runs ~20 tps on L4. i prepared some ego datasets (jina papers, which i know best), plus popular novels in chinese and english. the results are actually pretty good. some hallucination, but most answers are solid and well-grounded. what's more interesting is the cost: ~$0.26/h on L4 spot. single LLM. no vector database, no embedding model, no workflow/pipeline engineering. using kv cache as document store is nothing new, like the old CAG paper. but with quantized kv cache and modern attention (hybrid SSM-attention, GQA, MQA, MLA), the economics are changing fast. if we solve cold-prefill speed and decoding speed, and budget GPU costs keep dropping, the future of search could be vectorless. radical, but possible.

Han Xiao

42,344 Aufrufe • vor 3 Monaten

Today we announced our new Fairwater datacenter in Atlanta, connected with our first Fairwater site in Wisconsin and our broader Azure footprint to create the world’s first AI superfactory. Fairwater exemplifies our vision for a fungible fleet: infra that can serve any workload, anywhere, on fit-for-purpose accelerators and network paths, with maximum performance and efficiency. AI workloads have evolved beyond large-scale pre-training. Today, they encompass fine-tuning, reinforcement learning (RL), synthetic data generation, evaluation pipelines, and more. Fairwater is built to support this full lifecycle: Max density: Fairwater’s two-story design and liquid cooling system lets us place racks in three dimensions and pack them with GPUs as densely as possible, minimizing cable runs and improving latency and effective bandwidth. Fleet: Each Fairwater DC can integrate hundreds of thousands of the latest NVIDIA GPUs into a single coherent cluster. This provides flexible infra that can support the full spectrum of workloads, and ensure no GPU is left unnecessarily idle. And that’s on top of the more than 100,000 GB300s coming online this quarter alone for inference across the rest of our fleet. For us, it’s all about turning every gigawatt into the maximum number of useful tokens. Not every GW is created equal! Planet-scale: Every Fairwater DC will connect through our continent-spanning AI WAN to prior generations of AI supercomputers, forming a truly fungible pool of compute. This enables developers to scale beyond the capacity of a single site and dynamically land workloads on the right infra for their needs. Together, these innovations let us bring together different generations of silicon and AI systems across DCs and geos into a single elastic system that scales seamlessly across training and inference workloads And this elastic AI capacity is all available alongside all the other cloud services (compute, storage, databases, app services) that AI agents and workloads need. This is what we mean when we talk about building a fungible fleet – a single, unified platform that pushes the limits of performance per watt and per dollar. Read more:

Today we announced our new Fairwater datacenter in Atlanta, connected with our first Fairwater site in Wisconsin and our broader Azure footprint to create the world’s first AI superfactory. Fairwater exemplifies our vision for a fungible fleet: infra that can serve any workload, anywhere, on fit-for-purpose accelerators and network paths, with maximum performance and efficiency. AI workloads have evolved beyond large-scale pre-training. Today, they encompass fine-tuning, reinforcement learning (RL), synthetic data generation, evaluation pipelines, and more. Fairwater is built to support this full lifecycle: Max density: Fairwater’s two-story design and liquid cooling system lets us place racks in three dimensions and pack them with GPUs as densely as possible, minimizing cable runs and improving latency and effective bandwidth. Fleet: Each Fairwater DC can integrate hundreds of thousands of the latest NVIDIA GPUs into a single coherent cluster. This provides flexible infra that can support the full spectrum of workloads, and ensure no GPU is left unnecessarily idle. And that’s on top of the more than 100,000 GB300s coming online this quarter alone for inference across the rest of our fleet. For us, it’s all about turning every gigawatt into the maximum number of useful tokens. Not every GW is created equal! Planet-scale: Every Fairwater DC will connect through our continent-spanning AI WAN to prior generations of AI supercomputers, forming a truly fungible pool of compute. This enables developers to scale beyond the capacity of a single site and dynamically land workloads on the right infra for their needs. Together, these innovations let us bring together different generations of silicon and AI systems across DCs and geos into a single elastic system that scales seamlessly across training and inference workloads And this elastic AI capacity is all available alongside all the other cloud services (compute, storage, databases, app services) that AI agents and workloads need. This is what we mean when we talk about building a fungible fleet – a single, unified platform that pushes the limits of performance per watt and per dollar. Read more:

Satya Nadella

907,214 Aufrufe • vor 7 Monaten

$NBIS cofounder Roman Chernin describes how their recent acquisitions of Eigen AI and Clarifai were all about speed, incredible talent, and acceleration: "The philosophy is very simple. We need to build so many things, and we need to move so fast, that we're always looking for people who can accelerate us. It should be exceptional talent, and/or something that has a great adoption." "Our two recent acquisitions [were] two teams that work on inference optimization. A big part of our business is how efficiently we convert GPUs into tokens. And these two teams — Eigen AI and Clarifai — one is focused on model optimization, the engine of inference. How you run specific models and all the techniques around spec decoding, quantization, and so on." "And the other is system optimization. All the routing, KV caching, and orchestration across the big cluster of compute and so on." "We have a very strong internal team working on inference. But we felt that we needed to move faster, bring more capabilities. Because the market is so fast."

$NBIS cofounder Roman Chernin describes how their recent acquisitions of Eigen AI and Clarifai were all about speed, incredible talent, and acceleration: "The philosophy is very simple. We need to build so many things, and we need to move so fast, that we're always looking for people who can accelerate us. It should be exceptional talent, and/or something that has a great adoption." "Our two recent acquisitions [were] two teams that work on inference optimization. A big part of our business is how efficiently we convert GPUs into tokens. And these two teams — Eigen AI and Clarifai — one is focused on model optimization, the engine of inference. How you run specific models and all the techniques around spec decoding, quantization, and so on." "And the other is system optimization. All the routing, KV caching, and orchestration across the big cluster of compute and so on." "We have a very strong internal team working on inference. But we felt that we needed to move faster, bring more capabilities. Because the market is so fast."

TBPN

30,486 Aufrufe • vor 1 Monat