Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Running a single deep coding model at max context on Cerebras requires 24 systems ($24M Capex) just to support 256 concurrent users. At that scale, $100M gets you way more memory bandwidth in standard GB300 racks.

SemiAnalysis

109,405 subscribers

93,448 Aufrufe • vor 1 Monat •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with Red Hat and taught by Cedric Clyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management. Skills you'll gain: - Quantize a model and measure the accuracy tradeoff - Serve a model with vLLM and watch it handle concurrent requests efficiently - Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy Join and learn to serve LLMs efficiently:

New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with Red Hat and taught by Cedric Clyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management. Skills you'll gain: - Quantize a model and measure the accuracy tradeoff - Serve a model with vLLM and watch it handle concurrent requests efficiently - Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy Join and learn to serve LLMs efficiently:

Andrew Ng

125,665 Aufrufe • vor 1 Monat

DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I'll release when more mature, but it is almost sure that it will get merged.

DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I'll release when more mature, but it is almost sure that it will get merged.

antirez

84,069 Aufrufe • vor 2 Monaten

🎉 Day-0 support for in vLLM, available today in v0.23.0! Congrats to Z.ai on GLM-5.2, a flagship model built for long-horizon coding agents. ✨ 1M-token context, built to hold project-scale engineering work in a single run ✨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization ✨ One task can carry a full dev workflow, from requirements to a deployable product across platforms ✨ Client-side and mobile engineering, including an on-device debugging loop Try it out running it on vLLM today: 🔗

🎉 Day-0 support for in vLLM, available today in v0.23.0! Congrats to Z.ai on GLM-5.2, a flagship model built for long-horizon coding agents. ✨ 1M-token context, built to hold project-scale engineering work in a single run ✨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization ✨ One task can carry a full dev workflow, from requirements to a deployable product across platforms ✨ Client-side and mobile engineering, including an on-device debugging loop Try it out running it on vLLM today: 🔗

vLLM

36,019 Aufrufe • vor 1 Monat

Etched is deploying two new technologies in chip design: low-voltage inference and cluster-scale memory. CEO Gavin Uberti says they'll make their chips much more power-efficient and way, way faster than today's leading GPUs. He breaks it down: "We looked at a lot of early research directions, and we realized the key things that models need are way more compute and way faster memory." "If you think about inference, there are two key parts: prefill and decode. For prefill, it's a compute-bound problem. You need to have more FLOPS, more operations per second on each of your chips." "On our GPU, the bottleneck's actually thermals. You can't really run a GPU at more than around 50% of what it could theoretically do, or it'll melt." "So we're using a new technology today called low-voltage inference to try to solve this problem. You bring the voltage of the chip down dramatically, which allows us to have way, way better efficiency in terms of how much power is drawn per unit of math, and thus fit way way more flops onto the chip..." "For decode, it's all about bandwidth. Not just bandwidth on a chip, but bandwidth across your cluster. That's why we have this technology we call cluster-scale memory. It reduces the amount of time it takes to communicate from one chip to another dramatically." "As a result we can go use all of our HBM, HBM bandwidth, SRAM, SRAM bandwidth, and our scale-up domain as a single coherent pool. And that means if you're a user, you can go get much faster tokens per second, while still keeping your costs low."

Etched is deploying two new technologies in chip design: low-voltage inference and cluster-scale memory. CEO Gavin Uberti says they'll make their chips much more power-efficient and way, way faster than today's leading GPUs. He breaks it down: "We looked at a lot of early research directions, and we realized the key things that models need are way more compute and way faster memory." "If you think about inference, there are two key parts: prefill and decode. For prefill, it's a compute-bound problem. You need to have more FLOPS, more operations per second on each of your chips." "On our GPU, the bottleneck's actually thermals. You can't really run a GPU at more than around 50% of what it could theoretically do, or it'll melt." "So we're using a new technology today called low-voltage inference to try to solve this problem. You bring the voltage of the chip down dramatically, which allows us to have way, way better efficiency in terms of how much power is drawn per unit of math, and thus fit way way more flops onto the chip..." "For decode, it's all about bandwidth. Not just bandwidth on a chip, but bandwidth across your cluster. That's why we have this technology we call cluster-scale memory. It reduces the amount of time it takes to communicate from one chip to another dramatically." "As a result we can go use all of our HBM, HBM bandwidth, SRAM, SRAM bandwidth, and our scale-up domain as a single coherent pool. And that means if you're a user, you can go get much faster tokens per second, while still keeping your costs low."

TBPN

20,404 Aufrufe • vor 23 Tagen

Cerebras just IPO’d and the stock already ran up over 100% (Save this). For the entire 70 year history of the semiconductor industry, every company on earth has followed the same process. You take a dinner plate sized silicon wafer, put hundreds of tiny chips onto it, and dice it up like a pizza. Nvidia does it this way, AMD does it this way, Intel has done it this way for six decades and everyone who tried to break that convention failed. Until Cerebras asked the most annoyingly obvious question in the industry’s history, what if you just didn’t cut it? The result is the Wafer Scale Engine, a single chip 56 times larger than Nvidia’s H100 and it fundamentally changes the physics of how AI inference works. The reason this matters is not the size, it’s the bandwidth. Every time an AI model generates a single word, it has to reach into memory, pull weights, multiply them together, and produce a prediction and when you’re running millions of concurrent sessions at once, the bottleneck is not raw processing power but how fast data moves between memory and compute. Nvidia’s H100 moves data at roughly 3 terabytes per second, while Cerebras’ WSE-3 moves data at 21 petabytes per second, roughly 7,000 times faster because memory and compute live on the same enormous piece of silicon and data barely has to travel at all. That gap is exactly why OpenAI went from 150 tokens per second on traditional GPUs to 2,000 tokens per second on Cerebras hardware, and why AWS integrated Cerebras into Bedrock to deliver roughly 5x more inference capacity in the same physical footprint. The macro setup is making the trade even more urgent. South Korea DRAM export prices recently jumped 35%, flash memory surged 47%, and SSD pricing spiked nearly 140% and every single one of those increases hits Nvidia-based infrastructure directly, because the H100 requires 80GB of the most expensive, most contested memory in the AI supply chain. Cerebras’ WSE-3 uses zero external HBM memory, baking 44GB of SRAM directly into the wafer itself which means as memory pricing goes parabolic, every CFO evaluating AI infrastructure is suddenly looking much more seriously at the architecture that sidesteps that cost entirely. The demand is already showing up in the backlog. Cerebras ended 2025 with $24.6 billion in remaining performance obligations for a company doing just over $500 million in annual revenue, that is a number that implies years of contracted growth already sitting on the books. The IPO was 20x oversubscribed, the price range was raised twice before listing, and shares opened 89% above their listing price on a $5.55 billion raise that made it the largest semiconductor IPO in history. The risks are real and worth naming. 86% of 2025 revenue came from two entities with UAE ties, U.S. revenue actually fell 34% to $187 million, and the $20 billion OpenAI contract is conditional, if Cerebras misses delivery milestones, OpenAI can terminate and trigger repayment demands on a $1 billion loan facility. And yet the market is valuing Cerebras at roughly 91x trailing revenue, richer than Nvidia, AMD, and Arm combined. What investors are betting on is not that Cerebras beats Nvidia, it is that the inference supercycle is large enough to support an entirely different architecture optimized for a different workload, and that $24.6 billion in contracted backlog converts to diversified revenue before the market starts asking harder questions. CEO Andrew Feldman said this took a decade of late nights to get right, everyone who tried to copy it failed and given that the entire inference economy is now running through exactly the bottleneck Cerebras was built to eliminate, the market is starting to believe him.

Cerebras just IPO’d and the stock already ran up over 100% (Save this). For the entire 70 year history of the semiconductor industry, every company on earth has followed the same process. You take a dinner plate sized silicon wafer, put hundreds of tiny chips onto it, and dice it up like a pizza. Nvidia does it this way, AMD does it this way, Intel has done it this way for six decades and everyone who tried to break that convention failed. Until Cerebras asked the most annoyingly obvious question in the industry’s history, what if you just didn’t cut it? The result is the Wafer Scale Engine, a single chip 56 times larger than Nvidia’s H100 and it fundamentally changes the physics of how AI inference works. The reason this matters is not the size, it’s the bandwidth. Every time an AI model generates a single word, it has to reach into memory, pull weights, multiply them together, and produce a prediction and when you’re running millions of concurrent sessions at once, the bottleneck is not raw processing power but how fast data moves between memory and compute. Nvidia’s H100 moves data at roughly 3 terabytes per second, while Cerebras’ WSE-3 moves data at 21 petabytes per second, roughly 7,000 times faster because memory and compute live on the same enormous piece of silicon and data barely has to travel at all. That gap is exactly why OpenAI went from 150 tokens per second on traditional GPUs to 2,000 tokens per second on Cerebras hardware, and why AWS integrated Cerebras into Bedrock to deliver roughly 5x more inference capacity in the same physical footprint. The macro setup is making the trade even more urgent. South Korea DRAM export prices recently jumped 35%, flash memory surged 47%, and SSD pricing spiked nearly 140% and every single one of those increases hits Nvidia-based infrastructure directly, because the H100 requires 80GB of the most expensive, most contested memory in the AI supply chain. Cerebras’ WSE-3 uses zero external HBM memory, baking 44GB of SRAM directly into the wafer itself which means as memory pricing goes parabolic, every CFO evaluating AI infrastructure is suddenly looking much more seriously at the architecture that sidesteps that cost entirely. The demand is already showing up in the backlog. Cerebras ended 2025 with $24.6 billion in remaining performance obligations for a company doing just over $500 million in annual revenue, that is a number that implies years of contracted growth already sitting on the books. The IPO was 20x oversubscribed, the price range was raised twice before listing, and shares opened 89% above their listing price on a $5.55 billion raise that made it the largest semiconductor IPO in history. The risks are real and worth naming. 86% of 2025 revenue came from two entities with UAE ties, U.S. revenue actually fell 34% to $187 million, and the $20 billion OpenAI contract is conditional, if Cerebras misses delivery milestones, OpenAI can terminate and trigger repayment demands on a $1 billion loan facility. And yet the market is valuing Cerebras at roughly 91x trailing revenue, richer than Nvidia, AMD, and Arm combined. What investors are betting on is not that Cerebras beats Nvidia, it is that the inference supercycle is large enough to support an entirely different architecture optimized for a different workload, and that $24.6 billion in contracted backlog converts to diversified revenue before the market starts asking harder questions. CEO Andrew Feldman said this took a decade of late nights to get right, everyone who tried to copy it failed and given that the entire inference economy is now running through exactly the bottleneck Cerebras was built to eliminate, the market is starting to believe him.

Milk Road AI

30,441 Aufrufe • vor 2 Monaten

Micron is going to $4,000 and once you understand what inference actually is, the number stops sounding crazy (Save this). Dylan Patel just said that by 2030, OpenAI and Anthropic alone will need over 100 gigawatts of compute combined and by 2040, we may not even be measuring AI infrastructure in gigawatts anymore. We may be talking about terawatts. Every single one of those gigawatts needs memory to function. Without it, the compute is worthless. Most people heard that and thought about Nvidia but they should be thinking about Micron. Every AI model generating a response has two phases. The first is prefill, processing your prompt which is compute-heavy and the second is decode generating each word one token at a time and that phase is almost entirely memory-bound, not compute-bound. During decode, the GPU's processing units sit idle more than 95% of the time, waiting for data to arrive from memory. Google confirmed it in a research paper that decode-phase bottlenecks are dominated by memory bandwidth and capacity not raw compute. The GPU is not the bottleneck but the memory feeding the GPU is. This matters because inference is now where all the money lives. Training a model happens once, Inference happens billions of times a day every ChatGPT response, every Claude output, every agentic workflow running in the background and every one of those token streams is a billing event tied directly to memory performance. Adding more GPUs does not fix this because GPUs are already underutilized in inference because they are sitting idle waiting on memory. Adding more memory bandwidth and capacity is what directly reduces token cost, reduces latency, and allows the same cluster to serve dramatically more users simultaneously. Longer context windows compound the problem further, a model running a 1 million token context window requires dramatically more memory per session than a 10,000 token window, and every new model generation pushes context longer. The market treats memory as a downstream beneficiary of Nvidia orders. The correct framework is the opposite, Micron is the upstream constraint on how much value every Nvidia GPU can actually generate at inference scale. Micron guided Q4 to $50 billion in revenue, has HBM4 ramping at twice the pace of the prior generation, and CEO Sanjay Mehrotra has said supply will not catch demand before the end of 2027. At 8x forward earnings on $112 projected FY2027 EPS, Micron is the most undervalued infrastructure company in the entire AI stack. Inference is memory. Memory is Micron and the inference ramp has barely started. Milk Road Pro members are already up massively on this position and we're just getting started. If you want the full breakdown of what we're buying and why, come join us for just a dollar using the link below!

Micron is going to $4,000 and once you understand what inference actually is, the number stops sounding crazy (Save this). Dylan Patel just said that by 2030, OpenAI and Anthropic alone will need over 100 gigawatts of compute combined and by 2040, we may not even be measuring AI infrastructure in gigawatts anymore. We may be talking about terawatts. Every single one of those gigawatts needs memory to function. Without it, the compute is worthless. Most people heard that and thought about Nvidia but they should be thinking about Micron. Every AI model generating a response has two phases. The first is prefill, processing your prompt which is compute-heavy and the second is decode generating each word one token at a time and that phase is almost entirely memory-bound, not compute-bound. During decode, the GPU's processing units sit idle more than 95% of the time, waiting for data to arrive from memory. Google confirmed it in a research paper that decode-phase bottlenecks are dominated by memory bandwidth and capacity not raw compute. The GPU is not the bottleneck but the memory feeding the GPU is. This matters because inference is now where all the money lives. Training a model happens once, Inference happens billions of times a day every ChatGPT response, every Claude output, every agentic workflow running in the background and every one of those token streams is a billing event tied directly to memory performance. Adding more GPUs does not fix this because GPUs are already underutilized in inference because they are sitting idle waiting on memory. Adding more memory bandwidth and capacity is what directly reduces token cost, reduces latency, and allows the same cluster to serve dramatically more users simultaneously. Longer context windows compound the problem further, a model running a 1 million token context window requires dramatically more memory per session than a 10,000 token window, and every new model generation pushes context longer. The market treats memory as a downstream beneficiary of Nvidia orders. The correct framework is the opposite, Micron is the upstream constraint on how much value every Nvidia GPU can actually generate at inference scale. Micron guided Q4 to $50 billion in revenue, has HBM4 ramping at twice the pace of the prior generation, and CEO Sanjay Mehrotra has said supply will not catch demand before the end of 2027. At 8x forward earnings on $112 projected FY2027 EPS, Micron is the most undervalued infrastructure company in the entire AI stack. Inference is memory. Memory is Micron and the inference ramp has barely started. Milk Road Pro members are already up massively on this position and we're just getting started. If you want the full breakdown of what we're buying and why, come join us for just a dollar using the link below!

Milk Road AI

128,522 Aufrufe • vor 23 Tagen

🎁 We're giving away 5 Windsurf plans ($250 credit each)! Try SWE-1.6 — Cognition’s latest fast and intelligent agentic coding model, powered by Cerebras. In a side-by-side with Claude, the speed difference is clear. More iterations, faster fixes, better code. 💬Comment why you want access to enter. Five winners will be selected at random within 48 hours.

🎁 We're giving away 5 Windsurf plans ($250 credit each)! Try SWE-1.6 — Cognition’s latest fast and intelligent agentic coding model, powered by Cerebras. In a side-by-side with Claude, the speed difference is clear. More iterations, faster fixes, better code. 💬Comment why you want access to enter. Five winners will be selected at random within 48 hours.

Cerebras

105,033 Aufrufe • vor 2 Monaten

We asked James Wang why Cerebras can run large models so quickly. His answer: Inference speed is mostly a memory problem. Instead of constantly pulling weights from external memory, Cerebras splits the model across multiple wafers and pipelines the layers together. “Inference is all about memory bandwidth.” “You don’t want to store weights external to the wafer, because the second it’s outside, it becomes much slower.” “We put the weights on the inside.” “The team built a new software stack that basically lets us split the models by layer and store them layer by layer across multiple wafers.” “That allows us to never have to read from external memory, and that’s what makes it so fast.” “For context, Claude is about 100 tokens per second.”

We asked James Wang why Cerebras can run large models so quickly. His answer: Inference speed is mostly a memory problem. Instead of constantly pulling weights from external memory, Cerebras splits the model across multiple wafers and pipelines the layers together. “Inference is all about memory bandwidth.” “You don’t want to store weights external to the wafer, because the second it’s outside, it becomes much slower.” “We put the weights on the inside.” “The team built a new software stack that basically lets us split the models by layer and store them layer by layer across multiple wafers.” “That allows us to never have to read from external memory, and that’s what makes it so fast.” “For context, Claude is about 100 tokens per second.”

MTS

27,935 Aufrufe • vor 2 Monaten

Draft email replies faster. Suggested Replies offer one-tap responses that are based on the context of the conversation and match how you write. 🪄 Suggested Replies are available to US users at no cost! Proofread requires a Google AI Pro or Ultra subscription. →

Draft email replies faster. Suggested Replies offer one-tap responses that are based on the context of the conversation and match how you write. 🪄 Suggested Replies are available to US users at no cost! Proofread requires a Google AI Pro or Ultra subscription. →

Gmail

22,253 Aufrufe • vor 5 Monaten

Elon Musk: In order to run like really deep intelligence, you need a lot of compute. It’s not like you can just fire up a PC in your basement and be running AGI. At least not yet. Grok was trained on eight thousand A100s running at peak efficiency. And Grok’s going to get a lot better, by the way.

Elon Musk: In order to run like really deep intelligence, you need a lot of compute. It’s not like you can just fire up a PC in your basement and be running AGI. At least not yet. Grok was trained on eight thousand A100s running at peak efficiency. And Grok’s going to get a lot better, by the way.

Ian Miles Cheong

85,675 Aufrufe • vor 9 Monaten

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here's the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.

Qwen3.5 runs quite well in mlx-lm. Awesome that we have a frontier-level hybrid model. The context gets longer but the inference speed and memory use barely change. Here's the Q4 generating a space invaders game on an M3 Ultra. Generated 4,120 tokens at 37.6 tok/s.

Awni Hannun

48,673 Aufrufe • vor 5 Monaten

$Everyone's sleeping on MiniMax. Again. They just shipped M3. The first open-weights model to combine frontier coding, 1M context, and native multimodality in one drop. I plugged it into Claude Code this morning. Pasted a design from Dribbble. Watched M3 write production-ready React code in one session. At the agency, I just replaced Opus 4.8 with M3 for 80% of our coding tasks. The output is the same and we are running everything at a fraction of the cost. Open infrastructure is the future.$

Everyone's sleeping on MiniMax. Again. They just shipped M3. The first open-weights model to combine frontier coding, 1M context, and native multimodality in one drop. I plugged it into Claude Code this morning. Pasted a design from Dribbble. Watched M3 write production-ready React code in one session. At the agency, I just replaced Opus 4.8 with M3 for 80% of our coding tasks. The output is the same and we are running everything at a fraction of the cost. Open infrastructure is the future.

Prajwal Tomar

12,904 Aufrufe • vor 1 Monat

You asked, we delivered! 🔐 UniSat now support inscribing brc-20 and Runes in a single transaction! This means you can inscribe Runes alongside brc-20 in a more cost-efficient way compared to inscribing them separately. ✨ ✅ Simultaneously inscribe BRC-20 and Runes in one UTXO ✅ Save on inscribing fees ✅ Faster & more efficient inscribing ✅ Available on both Bitcoin and Fractal mainnets 🙌UniSat Inscribe Service on Bitcoin: In a single order, the first 20 items are FREE, with a max fee capped at 4,999 sats. The more you inscribe, the cheaper each inscription gets! 🔗 Learn more about double inscribing:

You asked, we delivered! 🔐 UniSat now support inscribing brc-20 and Runes in a single transaction! This means you can inscribe Runes alongside brc-20 in a more cost-efficient way compared to inscribing them separately. ✨ ✅ Simultaneously inscribe BRC-20 and Runes in one UTXO ✅ Save on inscribing fees ✅ Faster & more efficient inscribing ✅ Available on both Bitcoin and Fractal mainnets 🙌UniSat Inscribe Service on Bitcoin: In a single order, the first 20 items are FREE, with a max fee capped at 4,999 sats. The more you inscribe, the cheaper each inscription gets! 🔗 Learn more about double inscribing:

UniSat - wallet, explorer & extension for bitcoin.

22,010 Aufrufe • vor 1 Jahr

Qwen 3.6 35B running at over 100 tokens per second on my $5,399 MacBook Pro M5 Max. This is the best local AI model I have ever run. 128GB of unified memory. No cloud. No API costs. No rate limits. Just raw local inference at speeds I didn't think were possible on a laptop. This model is more intelligent than GPT 5 on benchmarks. Running locally. On a MacBook. For free after the hardware cost. I said local AI would never compete with frontier. I'm starting to rethink that. The gap is closing faster than anyone expected.

Qwen 3.6 35B running at over 100 tokens per second on my $5,399 MacBook Pro M5 Max. This is the best local AI model I have ever run. 128GB of unified memory. No cloud. No API costs. No rate limits. Just raw local inference at speeds I didn't think were possible on a laptop. This model is more intelligent than GPT 5 on benchmarks. Running locally. On a MacBook. For free after the hardware cost. I said local AI would never compete with frontier. I'm starting to rethink that. The gap is closing faster than anyone expected.

BridgeMind

62,800 Aufrufe • vor 1 Monat

SITUATION EXPLAINED: Cerebras raised $5.55 billion in their IPO and closing their first day of trading valued at $66 billion, making it the biggest US tech IPO since Snowflake in 2020. Cerebras makes Wafer-Scale Engine chips built for AI inference. We asked Sarah Fong the main difference between wafer-scale chips and traditional GPUs: - GPUs are great at parallel work (graphics, training) - AI inference is sequential, AKA one token at a time This causes the "memory wall" problem: - Every GPU core needs model weights, KV cache, and activations to do its math - On a GPU, that data lives in off-chip memory (HBM) - Cores constantly load and offload from off-chip memory, which is a huge bottleneck; hardware accounts for ~70% of inference latency Cerebras' chips: -Dinner-plate sized (vs. GPUs which are palm-sized) with tens of thousands of cores -Memory sits directly on top of the cores as distributed SRAM -Weights and KV cache can be accessed at on-chip speeds in the PB/s range, compared with off-chip speeds in the TB/s range achieved by GPUs with HBM.

SITUATION EXPLAINED: Cerebras raised $5.55 billion in their IPO and closing their first day of trading valued at $66 billion, making it the biggest US tech IPO since Snowflake in 2020. Cerebras makes Wafer-Scale Engine chips built for AI inference. We asked Sarah Fong the main difference between wafer-scale chips and traditional GPUs: - GPUs are great at parallel work (graphics, training) - AI inference is sequential, AKA one token at a time This causes the "memory wall" problem: - Every GPU core needs model weights, KV cache, and activations to do its math - On a GPU, that data lives in off-chip memory (HBM) - Cores constantly load and offload from off-chip memory, which is a huge bottleneck; hardware accounts for ~70% of inference latency Cerebras' chips: -Dinner-plate sized (vs. GPUs which are palm-sized) with tens of thousands of cores -Memory sits directly on top of the cores as distributed SRAM -Weights and KV cache can be accessed at on-chip speeds in the PB/s range, compared with off-chip speeds in the TB/s range achieved by GPUs with HBM.

MTS

44,057 Aufrufe • vor 2 Monaten

MEET THE NVIDIA KILLER: OpenAI bet $10 BILLION on this company that makes chips 20x faster than Nvidia's. If this plays out as expected, it’s over for Nvidia. Cerebras Systems just locked in 750 megawatts of computing power to OpenAI through 2028. For reference: that's equivalent to the annual power consumption of 600,000 US homes. The deal? Over $10 billion. Here's what nobody understands: Cerebras doesn't make normal chips. Nvidia sells you thousands of tiny chips that you connect together. Cerebras makes ONE chip. A single wafer-scale processor the size of a dinner plate. 900,000 AI cores. 4 trillion transistors. All on one piece of silicon. The result? When OpenAI tested it, Cerebras ran inference 20X FASTER than Nvidia GPUs. That's not incremental improvement. That's a different category of performance. But here's where the story gets wild: Four months ago, Cerebras was a struggling company. Their IPO filing revealed that 87% of their revenue came from ONE customer: G42, a UAE-based AI firm. The US government launched a national security review. G42 had ties to Huawei. Ties to China. The IPO collapsed. Investors panicked. Cerebras withdrew their filing in October 2025. Most startups would've been dead. Instead, Cerebras did the opposite. They raised $1.1 billion at an $8.1 billion valuation. Kicked G42 out of the cap table entirely. Got CFIUS clearance. Then landed the OpenAI deal. Now they're raising ANOTHER $1 billion at a $22 billion valuation. They more than DOUBLED their valuation in 4 months. From near-death to $22 billion. While getting rid of their biggest customer. Why OpenAI chose them: ChatGPT has 900 million weekly users. Sam Altman keeps saying they have a "severe shortage" of compute. They need SPEED, not just power. When you ask ChatGPT a question, there's a loop happening: You send request → model thinks → sends response back Nvidia chips are fast at training models. Cerebras chips are built specifically for inference. For real-time responses. For the exact bottleneck OpenAI is trying to solve. Sachin Katti from OpenAI said it best: "Cerebras adds a dedicated low-latency inference solution to our platform. That means faster responses, more natural interactions, and a stronger foundation to scale real-time AI to many more people." In other words: "We need this to scale ChatGPT." The competitive landscape just shifted: Nvidia announced a $100 billion deal with OpenAI in September. But it's still not finalized. Meanwhile, Cerebras closed their deal before Thanksgiving. And it's ALREADY being deployed. Here's the part that should terrify Nvidia: In December, Nvidia bought Groq for $20 billion. Groq makes fast inference chips. Just like Cerebras. So why would Nvidia spend $20 billion buying a competitor to something they supposedly already dominate? Because they know what's coming. Inference is the new battleground. And Cerebras is winning it. The IPO is coming Q2 2026. After this OpenAI deal, Cerebras now has: ✓ IBM contracts ✓ Department of Energy contracts ✓ OpenAI locked in for 3 years ✓ $22 billion valuation ✓ CFIUS clearance ✓ Zero customer concentration risk They went from 87% revenue dependency on one customer to the most diversified chip company outside Nvidia. In four months. The lesson? Smart money doesn't follow headlines. It follows where the AI leaders are actually spending. OpenAI didn't announce this deal for publicity. They need Cerebras hardware to scale ChatGPT. That's a $10 billion vote of confidence. While everyone's watching Nvidia stock, the real war is happening in inference. And the company with ONE giant chip just beat the company with thousands of tiny ones. What do you think happens when Cerebras IPOs?

MEET THE NVIDIA KILLER: OpenAI bet $10 BILLION on this company that makes chips 20x faster than Nvidia's. If this plays out as expected, it’s over for Nvidia. Cerebras Systems just locked in 750 megawatts of computing power to OpenAI through 2028. For reference: that's equivalent to the annual power consumption of 600,000 US homes. The deal? Over $10 billion. Here's what nobody understands: Cerebras doesn't make normal chips. Nvidia sells you thousands of tiny chips that you connect together. Cerebras makes ONE chip. A single wafer-scale processor the size of a dinner plate. 900,000 AI cores. 4 trillion transistors. All on one piece of silicon. The result? When OpenAI tested it, Cerebras ran inference 20X FASTER than Nvidia GPUs. That's not incremental improvement. That's a different category of performance. But here's where the story gets wild: Four months ago, Cerebras was a struggling company. Their IPO filing revealed that 87% of their revenue came from ONE customer: G42, a UAE-based AI firm. The US government launched a national security review. G42 had ties to Huawei. Ties to China. The IPO collapsed. Investors panicked. Cerebras withdrew their filing in October 2025. Most startups would've been dead. Instead, Cerebras did the opposite. They raised $1.1 billion at an $8.1 billion valuation. Kicked G42 out of the cap table entirely. Got CFIUS clearance. Then landed the OpenAI deal. Now they're raising ANOTHER $1 billion at a $22 billion valuation. They more than DOUBLED their valuation in 4 months. From near-death to $22 billion. While getting rid of their biggest customer. Why OpenAI chose them: ChatGPT has 900 million weekly users. Sam Altman keeps saying they have a "severe shortage" of compute. They need SPEED, not just power. When you ask ChatGPT a question, there's a loop happening: You send request → model thinks → sends response back Nvidia chips are fast at training models. Cerebras chips are built specifically for inference. For real-time responses. For the exact bottleneck OpenAI is trying to solve. Sachin Katti from OpenAI said it best: "Cerebras adds a dedicated low-latency inference solution to our platform. That means faster responses, more natural interactions, and a stronger foundation to scale real-time AI to many more people." In other words: "We need this to scale ChatGPT." The competitive landscape just shifted: Nvidia announced a $100 billion deal with OpenAI in September. But it's still not finalized. Meanwhile, Cerebras closed their deal before Thanksgiving. And it's ALREADY being deployed. Here's the part that should terrify Nvidia: In December, Nvidia bought Groq for $20 billion. Groq makes fast inference chips. Just like Cerebras. So why would Nvidia spend $20 billion buying a competitor to something they supposedly already dominate? Because they know what's coming. Inference is the new battleground. And Cerebras is winning it. The IPO is coming Q2 2026. After this OpenAI deal, Cerebras now has: ✓ IBM contracts ✓ Department of Energy contracts ✓ OpenAI locked in for 3 years ✓ $22 billion valuation ✓ CFIUS clearance ✓ Zero customer concentration risk They went from 87% revenue dependency on one customer to the most diversified chip company outside Nvidia. In four months. The lesson? Smart money doesn't follow headlines. It follows where the AI leaders are actually spending. OpenAI didn't announce this deal for publicity. They need Cerebras hardware to scale ChatGPT. That's a $10 billion vote of confidence. While everyone's watching Nvidia stock, the real war is happening in inference. And the company with ONE giant chip just beat the company with thousands of tiny ones. What do you think happens when Cerebras IPOs?

Ricardo

28,088 Aufrufe • vor 6 Monaten

Our research team just released Flex-Forcing: a video generation method that lets a single model switch between generation methods at inference time. Right now there are two main approaches to video generation. Bidirectional diffusion models attend to every frame at once, holding structure well at the cost of speed. Autoregressive models generate frame by frame, so they stream fast and scale to long clips, but accumulate error and drift over time. Flex-Forcing trains a single model to do both, letting you choose from the range at inference based on your compute budget.

Our research team just released Flex-Forcing: a video generation method that lets a single model switch between generation methods at inference time. Right now there are two main approaches to video generation. Bidirectional diffusion models attend to every frame at once, holding structure well at the cost of speed. Autoregressive models generate frame by frame, so they stream fast and scale to long clips, but accumulate error and drift over time. Flex-Forcing trains a single model to do both, letting you choose from the range at inference based on your compute budget.

NVIDIA AI

31,412 Aufrufe • vor 14 Tagen

ByteDance just dropped UNO on Hugging Face Less-to-More Generalization Unlocking More Controllability by In-Context Generation a universal framework that evolves from single-subject to multi-subject customization. UNO demonstrates strong generalization capabilities and is capable of unifying diverse tasks under one model

ByteDance just dropped UNO on Hugging Face Less-to-More Generalization Unlocking More Controllability by In-Context Generation a universal framework that evolves from single-subject to multi-subject customization. UNO demonstrates strong generalization capabilities and is capable of unifying diverse tasks under one model

AK

82,709 Aufrufe • vor 1 Jahr

🚨| Taylor Swift meeting a family at the children hospital today: “Gorgeous family! OMG you guys are beautiful every single one of you!” The family on Taylor: changed it from something terrifying to a memorable memory that she gets to talk about forever.

🚨| Taylor Swift meeting a family at the children hospital today: “Gorgeous family! OMG you guys are beautiful every single one of you!” The family on Taylor: changed it from something terrifying to a memorable memory that she gets to talk about forever.

Taylor Swift Updates

387,843 Aufrufe • vor 1 Jahr

Thanks to @StellaDudzic, I've just discovered that the new model Casio calculators don't deal with the standard form button in the same way at all. I really like the new models, but this is not a good change...

Thanks to @StellaDudzic, I've just discovered that the new model Casio calculators don't deal with the standard form button in the same way at all. I really like the new models, but this is not a good change...

Peter Williams

27,393 Aufrufe • vor 2 Jahren