Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Running a single deep coding model at max context on Cerebras requires 24 systems ($24M Capex) just to support 256 concurrent users. At that scale, $100M gets you way more memory bandwidth in standard GB300 racks.

SemiAnalysis

109,405 subscribers

93,448 views • 21 days ago •via X (Twitter)

Education Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Cerebras Code: 20x faster than Claude, 1x the price Today we are launching two monthly coding plans: ➡️Cerebras Code Pro: $50/m – for indie developers ➡️Cerebras Code Max: $200/m – for power users with 5x rate limits Both plans get: Qwen3-Coder at 2,000 tokens/s, 131K context, and no weekly limits. Sign up now:

Cerebras Code: 20x faster than Claude, 1x the price Today we are launching two monthly coding plans: ➡️Cerebras Code Pro: $50/m – for indie developers ➡️Cerebras Code Max: $200/m – for power users with 5x rate limits Both plans get: Qwen3-Coder at 2,000 tokens/s, 131K context, and no weekly limits. Sign up now:

Cerebras

461,093 views • 10 months ago

New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with Red Hat and taught by Cedric Clyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management. Skills you'll gain: - Quantize a model and measure the accuracy tradeoff - Serve a model with vLLM and watch it handle concurrent requests efficiently - Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy Join and learn to serve LLMs efficiently:

New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with Red Hat and taught by Cedric Clyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management. Skills you'll gain: - Quantize a model and measure the accuracy tradeoff - Serve a model with vLLM and watch it handle concurrent requests efficiently - Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy Join and learn to serve LLMs efficiently:

Andrew Ng

109,881 views • 14 days ago

DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I'll release when more mature, but it is almost sure that it will get merged.

DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I'll release when more mature, but it is almost sure that it will get merged.

antirez

83,448 views • 1 month ago

🎉 Day-0 support for in vLLM, available today in v0.23.0! Congrats to Z.ai on GLM-5.2, a flagship model built for long-horizon coding agents. ✨ 1M-token context, built to hold project-scale engineering work in a single run ✨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization ✨ One task can carry a full dev workflow, from requirements to a deployable product across platforms ✨ Client-side and mobile engineering, including an on-device debugging loop Try it out running it on vLLM today: 🔗

🎉 Day-0 support for in vLLM, available today in v0.23.0! Congrats to Z.ai on GLM-5.2, a flagship model built for long-horizon coding agents. ✨ 1M-token context, built to hold project-scale engineering work in a single run ✨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization ✨ One task can carry a full dev workflow, from requirements to a deployable product across platforms ✨ Client-side and mobile engineering, including an on-device debugging loop Try it out running it on vLLM today: 🔗

vLLM

33,036 views • 2 days ago

$200/month gets you the Ferrari of vibe coding. Dozens of tool calls, instantly handled with 1.5M TPM on Cerebras Code Max available now!

$200/month gets you the Ferrari of vibe coding. Dozens of tool calls, instantly handled with 1.5M TPM on Cerebras Code Max available now!

Daniel Kim

186,185 views • 8 months ago

Cerebras just IPO’d and the stock already ran up over 100% (Save this). For the entire 70 year history of the semiconductor industry, every company on earth has followed the same process. You take a dinner plate sized silicon wafer, put hundreds of tiny chips onto it, and dice it up like a pizza. Nvidia does it this way, AMD does it this way, Intel has done it this way for six decades and everyone who tried to break that convention failed. Until Cerebras asked the most annoyingly obvious question in the industry’s history, what if you just didn’t cut it? The result is the Wafer Scale Engine, a single chip 56 times larger than Nvidia’s H100 and it fundamentally changes the physics of how AI inference works. The reason this matters is not the size, it’s the bandwidth. Every time an AI model generates a single word, it has to reach into memory, pull weights, multiply them together, and produce a prediction and when you’re running millions of concurrent sessions at once, the bottleneck is not raw processing power but how fast data moves between memory and compute. Nvidia’s H100 moves data at roughly 3 terabytes per second, while Cerebras’ WSE-3 moves data at 21 petabytes per second, roughly 7,000 times faster because memory and compute live on the same enormous piece of silicon and data barely has to travel at all. That gap is exactly why OpenAI went from 150 tokens per second on traditional GPUs to 2,000 tokens per second on Cerebras hardware, and why AWS integrated Cerebras into Bedrock to deliver roughly 5x more inference capacity in the same physical footprint. The macro setup is making the trade even more urgent. South Korea DRAM export prices recently jumped 35%, flash memory surged 47%, and SSD pricing spiked nearly 140% and every single one of those increases hits Nvidia-based infrastructure directly, because the H100 requires 80GB of the most expensive, most contested memory in the AI supply chain. Cerebras’ WSE-3 uses zero external HBM memory, baking 44GB of SRAM directly into the wafer itself which means as memory pricing goes parabolic, every CFO evaluating AI infrastructure is suddenly looking much more seriously at the architecture that sidesteps that cost entirely. The demand is already showing up in the backlog. Cerebras ended 2025 with $24.6 billion in remaining performance obligations for a company doing just over $500 million in annual revenue, that is a number that implies years of contracted growth already sitting on the books. The IPO was 20x oversubscribed, the price range was raised twice before listing, and shares opened 89% above their listing price on a $5.55 billion raise that made it the largest semiconductor IPO in history. The risks are real and worth naming. 86% of 2025 revenue came from two entities with UAE ties, U.S. revenue actually fell 34% to $187 million, and the $20 billion OpenAI contract is conditional, if Cerebras misses delivery milestones, OpenAI can terminate and trigger repayment demands on a $1 billion loan facility. And yet the market is valuing Cerebras at roughly 91x trailing revenue, richer than Nvidia, AMD, and Arm combined. What investors are betting on is not that Cerebras beats Nvidia, it is that the inference supercycle is large enough to support an entirely different architecture optimized for a different workload, and that $24.6 billion in contracted backlog converts to diversified revenue before the market starts asking harder questions. CEO Andrew Feldman said this took a decade of late nights to get right, everyone who tried to copy it failed and given that the entire inference economy is now running through exactly the bottleneck Cerebras was built to eliminate, the market is starting to believe him.

Cerebras just IPO’d and the stock already ran up over 100% (Save this). For the entire 70 year history of the semiconductor industry, every company on earth has followed the same process. You take a dinner plate sized silicon wafer, put hundreds of tiny chips onto it, and dice it up like a pizza. Nvidia does it this way, AMD does it this way, Intel has done it this way for six decades and everyone who tried to break that convention failed. Until Cerebras asked the most annoyingly obvious question in the industry’s history, what if you just didn’t cut it? The result is the Wafer Scale Engine, a single chip 56 times larger than Nvidia’s H100 and it fundamentally changes the physics of how AI inference works. The reason this matters is not the size, it’s the bandwidth. Every time an AI model generates a single word, it has to reach into memory, pull weights, multiply them together, and produce a prediction and when you’re running millions of concurrent sessions at once, the bottleneck is not raw processing power but how fast data moves between memory and compute. Nvidia’s H100 moves data at roughly 3 terabytes per second, while Cerebras’ WSE-3 moves data at 21 petabytes per second, roughly 7,000 times faster because memory and compute live on the same enormous piece of silicon and data barely has to travel at all. That gap is exactly why OpenAI went from 150 tokens per second on traditional GPUs to 2,000 tokens per second on Cerebras hardware, and why AWS integrated Cerebras into Bedrock to deliver roughly 5x more inference capacity in the same physical footprint. The macro setup is making the trade even more urgent. South Korea DRAM export prices recently jumped 35%, flash memory surged 47%, and SSD pricing spiked nearly 140% and every single one of those increases hits Nvidia-based infrastructure directly, because the H100 requires 80GB of the most expensive, most contested memory in the AI supply chain. Cerebras’ WSE-3 uses zero external HBM memory, baking 44GB of SRAM directly into the wafer itself which means as memory pricing goes parabolic, every CFO evaluating AI infrastructure is suddenly looking much more seriously at the architecture that sidesteps that cost entirely. The demand is already showing up in the backlog. Cerebras ended 2025 with $24.6 billion in remaining performance obligations for a company doing just over $500 million in annual revenue, that is a number that implies years of contracted growth already sitting on the books. The IPO was 20x oversubscribed, the price range was raised twice before listing, and shares opened 89% above their listing price on a $5.55 billion raise that made it the largest semiconductor IPO in history. The risks are real and worth naming. 86% of 2025 revenue came from two entities with UAE ties, U.S. revenue actually fell 34% to $187 million, and the $20 billion OpenAI contract is conditional, if Cerebras misses delivery milestones, OpenAI can terminate and trigger repayment demands on a $1 billion loan facility. And yet the market is valuing Cerebras at roughly 91x trailing revenue, richer than Nvidia, AMD, and Arm combined. What investors are betting on is not that Cerebras beats Nvidia, it is that the inference supercycle is large enough to support an entirely different architecture optimized for a different workload, and that $24.6 billion in contracted backlog converts to diversified revenue before the market starts asking harder questions. CEO Andrew Feldman said this took a decade of late nights to get right, everyone who tried to copy it failed and given that the entire inference economy is now running through exactly the bottleneck Cerebras was built to eliminate, the market is starting to believe him.

Milk Road AI

30,441 views • 1 month ago

Pipe (Pipe Network) is a decentralized content delivery network (dCDN) that accelerates delivery by leveraging users’ excess storage & bandwidth. More nodes = faster delivery. More from David Rhodus 🇺🇸 at Breakpoint '24

Pipe (Pipe Network) is a decentralized content delivery network (dCDN) that accelerates delivery by leveraging users’ excess storage & bandwidth. More nodes = faster delivery. More from David Rhodus 🇺🇸 at Breakpoint '24

Solana

78,975 views • 1 year ago

Max Thomas 🇺🇸 running a PB of 9.90s (1.2) in his 100m season opener at the Florida Relays!

Max Thomas 🇺🇸 running a PB of 9.90s (1.2) in his 100m season opener at the Florida Relays!

Track & Field Gazette

29,538 views • 2 months ago

Perplexity Introduces Personalized Memory 🔥 TLDR : - Gives AI assistants memory to recall preferences, interests, and past chats. - It removes manual context engineering by auto loading relevant context. - Answers become more accurate and personalized using direct memory retrieval. - Users stay fully in control with on or off switches, incognito protection, and encryption. - Memory works across all models so your context follows you anywhere in Perplexity. - Comet Assistant also gets true memory, making it more consistent, personalized, and powerful.

Perplexity Introduces Personalized Memory 🔥 TLDR : - Gives AI assistants memory to recall preferences, interests, and past chats. - It removes manual context engineering by auto loading relevant context. - Answers become more accurate and personalized using direct memory retrieval. - Users stay fully in control with on or off switches, incognito protection, and encryption. - Memory works across all models so your context follows you anywhere in Perplexity. - Comet Assistant also gets true memory, making it more consistent, personalized, and powerful.

AshutoshShrivastava

22,006 views • 6 months ago

With TRAE Max Mode you can now scale your context window from 200k → 1M for deep coding, longer reasoning, and complex projects. You can stay in control with full transparency on token usage, cost, and fast requests right after each conversation.

With TRAE Max Mode you can now scale your context window from 200k → 1M for deep coding, longer reasoning, and complex projects. You can stay in control with full transparency on token usage, cost, and fast requests right after each conversation.

TRAE

14,402 views • 9 months ago

The first-ever hydrogen-powered NVIDIA GB300 NVL72 systems are online. Training & inference at scale — with zero emissions, zero water. Lambda × Supermicro × ECL →

The first-ever hydrogen-powered NVIDIA GB300 NVL72 systems are online. Training & inference at scale — with zero emissions, zero water. Lambda × Supermicro × ECL →

Lambda

17,276 views • 8 months ago

🎁 We're giving away 5 Windsurf plans ($250 credit each)! Try SWE-1.6 — Cognition’s latest fast and intelligent agentic coding model, powered by Cerebras. In a side-by-side with Claude, the speed difference is clear. More iterations, faster fixes, better code. 💬Comment why you want access to enter. Five winners will be selected at random within 48 hours.

🎁 We're giving away 5 Windsurf plans ($250 credit each)! Try SWE-1.6 — Cognition’s latest fast and intelligent agentic coding model, powered by Cerebras. In a side-by-side with Claude, the speed difference is clear. More iterations, faster fixes, better code. 💬Comment why you want access to enter. Five winners will be selected at random within 48 hours.

Cerebras

104,557 views • 1 month ago

🚨 Cerebras Inference is now 3x faster: Llama3.1-70B just broke 2,100 tokens/s - 16x faster than the fastest GPU solution - 8x faster than GPUs running Llama *3B* - It's like the perf of a new hardware generation in a single software release Available now at

🚨 Cerebras Inference is now 3x faster: Llama3.1-70B just broke 2,100 tokens/s - 16x faster than the fastest GPU solution - 8x faster than GPUs running Llama 3B - It's like the perf of a new hardware generation in a single software release Available now at

Cerebras

236,030 views • 1 year ago

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

Ivan Fioravanti ᯅ

24,155 views • 6 days ago

We asked James Wang why Cerebras can run large models so quickly. His answer: Inference speed is mostly a memory problem. Instead of constantly pulling weights from external memory, Cerebras splits the model across multiple wafers and pipelines the layers together. “Inference is all about memory bandwidth.” “You don’t want to store weights external to the wafer, because the second it’s outside, it becomes much slower.” “We put the weights on the inside.” “The team built a new software stack that basically lets us split the models by layer and store them layer by layer across multiple wafers.” “That allows us to never have to read from external memory, and that’s what makes it so fast.” “For context, Claude is about 100 tokens per second.”

We asked James Wang why Cerebras can run large models so quickly. His answer: Inference speed is mostly a memory problem. Instead of constantly pulling weights from external memory, Cerebras splits the model across multiple wafers and pipelines the layers together. “Inference is all about memory bandwidth.” “You don’t want to store weights external to the wafer, because the second it’s outside, it becomes much slower.” “We put the weights on the inside.” “The team built a new software stack that basically lets us split the models by layer and store them layer by layer across multiple wafers.” “That allows us to never have to read from external memory, and that’s what makes it so fast.” “For context, Claude is about 100 tokens per second.”

MTS

27,867 views • 1 month ago

Built FlickAI in 24 hours at the Cerebras x Cline Vibe Coder Hackathon ▨ A desktop AI assistant that: • Sees what’s on your screen • Wakes up instantly • Helps with coding, emails, notes anything and everything. Built this with my teammate samarth saxena

Built FlickAI in 24 hours at the Cerebras x Cline Vibe Coder Hackathon ▨ A desktop AI assistant that: • Sees what’s on your screen • Wakes up instantly • Helps with coding, emails, notes anything and everything. Built this with my teammate samarth saxena

Maaz (testnet arc)

27,129 views • 4 months ago

AI coding without systems thinking is just tech debt on speedrun. Delty (@delty_ai) is an AI Staff Engineer for your team. With deep expertise, it designs software systems, evaluates tradeoffs and makes AI coding agents smarter with your engineering context. Congrats on the launch, Lalit Kundu and Catherine!

AI coding without systems thinking is just tech debt on speedrun. Delty (@delty_ai) is an AI Staff Engineer for your team. With deep expertise, it designs software systems, evaluates tradeoffs and makes AI coding agents smarter with your engineering context. Congrats on the launch, Lalit Kundu and Catherine!

Y Combinator

73,297 views • 1 year ago

Draft email replies faster. Suggested Replies offer one-tap responses that are based on the context of the conversation and match how you write. 🪄 Suggested Replies are available to US users at no cost! Proofread requires a Google AI Pro or Ultra subscription. →

Draft email replies faster. Suggested Replies offer one-tap responses that are based on the context of the conversation and match how you write. 🪄 Suggested Replies are available to US users at no cost! Proofread requires a Google AI Pro or Ultra subscription. →

Gmail

22,253 views • 4 months ago

OpenAI Codex-Spark powered by Cerebras You can now just build things faster—at 1,000 tokens/s.

OpenAI Codex-Spark powered by Cerebras You can now just build things faster—at 1,000 tokens/s.

Cerebras

287,320 views • 4 months ago

AI coding agents hit a wall when codebases get massive. Even with 2M token context windows, a 10M line codebase needs 100M tokens. The real bottleneck isn't just ingesting code - it's getting models to actually pay attention to all that context effectively.

AI coding agents hit a wall when codebases get massive. Even with 2M token context windows, a 10M line codebase needs 100M tokens. The real bottleneck isn't just ingesting code - it's getting models to actually pay attention to all that context effectively.

Garry Tan

976,161 views • 1 year ago