Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

introducing simple-llm: a ~950 line, powerful & extensible inference engine that performs on par with vllm. enjoy :) performance (gpt-oss-120b, on an h100): - batch=1: 135 tok/s (vllm: 138) - batch=64: 4,041 tok/s (vllm: 3,846)

naklecha

16,895 subscribers

59,730 просмотров • 6 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Demo for the sharded inference engine on: gpt-oss-120b on 4090's, max input context 95k, max output context infinite (tok/s degrades with context) queue of max 12 prompts concurrently, use it wisely

Demo for the sharded inference engine on: gpt-oss-120b on 4090's, max input context 95k, max output context infinite (tok/s degrades with context) queue of max 12 prompts concurrently, use it wisely

leyten

90,845 просмотров • 1 месяц назад

Gemma 4 Diffusion landed in vLLM last week. Day 0. First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel. Result: 1,000+ tokens per second at batch size 1 on a single H100. Built on Model Runner V2. Google Gemma

Gemma 4 Diffusion landed in vLLM last week. Day 0. First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel. Result: 1,000+ tokens per second at batch size 1 on a single H100. Built on Model Runner V2. Google Gemma

Red Hat AI

17,637 просмотров • 1 месяц назад

16 people chatting with Qwen3.6-35B at once ONE DGX Spark. This is a real capture, not a mockup: every token you see replays at its actual measured arrival time. Peaks ~440 tok/s total. A single user gets 105 tok/s. NVFP4 + MTP-3 on vLLM.

16 people chatting with Qwen3.6-35B at once ONE DGX Spark. This is a real capture, not a mockup: every token you see replays at its actual measured arrival time. Peaks ~440 tok/s total. A single user gets 105 tok/s. NVFP4 + MTP-3 on vLLM.

Wësche

26,094 просмотров • 20 дней назад

Dozens of teams have asked my advice on running LLMs. How fast is DeepSeek V3 with vLLM on 8 GPUs? What's the max throughput of Qwen 2.5 Coder with SGLang on one H100? Running & sharing benchmarks ad hoc was too slow So we built a tiny app, the LLM Engine Advisor

Dozens of teams have asked my advice on running LLMs. How fast is DeepSeek V3 with vLLM on 8 GPUs? What's the max throughput of Qwen 2.5 Coder with SGLang on one H100? Running & sharing benchmarks ad hoc was too slow So we built a tiny app, the LLM Engine Advisor

Charles 🎉 Frye

86,411 просмотров • 1 год назад

Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on vLLM GB10 @ 74W The future isn't 10,000 GPUs in a nuclear-powered data center. It’s 10 agents on your desk solving your problems while you make your coffee.

Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on vLLM GB10 @ 74W The future isn't 10,000 GPUs in a nuclear-powered data center. It’s 10 agents on your desk solving your problems while you make your coffee.

Mitko Vasilev

146,363 просмотров • 3 месяцев назад

Qwen3.6 35B-A3B dropped yesterday, so I ran it on 4 GPUs to see how it performs: 🟣 RTX 3090 — 49.78 tok/s, TTFT 852ms 🟡 RTX 4090 — 118.93 tok/s, TTFT 686ms 🟢 RTX 5090 — 160.37 tok/s, TTFT 409ms 🔵 DGX Spark — 59.98 tok/s, TTFT 228ms I went with ollama as the backend because honestly, it's the easiest way for most people to get started. One command, model pulled, done. I used Q4_K_M (24GB) across all four cards. The reason is the 3090 and 4090 don't support NVFP4 (only the 5090 and DGX Spark could use it). Keeping the same quant everywhere felt like the fairest way to compare. And yes, you can absolutely squeeze more performance out of every card with vLLM, SGLang, or TensorRT-LLM. But that's not what this test is about. This is just the out-of-the-box experience for folks who own a GPU and want to try the new model tonight.

Qwen3.6 35B-A3B dropped yesterday, so I ran it on 4 GPUs to see how it performs: 🟣 RTX 3090 — 49.78 tok/s, TTFT 852ms 🟡 RTX 4090 — 118.93 tok/s, TTFT 686ms 🟢 RTX 5090 — 160.37 tok/s, TTFT 409ms 🔵 DGX Spark — 59.98 tok/s, TTFT 228ms I went with ollama as the backend because honestly, it's the easiest way for most people to get started. One command, model pulled, done. I used Q4_K_M (24GB) across all four cards. The reason is the 3090 and 4090 don't support NVFP4 (only the 5090 and DGX Spark could use it). Keeping the same quant everywhere felt like the fairest way to compare. And yes, you can absolutely squeeze more performance out of every card with vLLM, SGLang, or TensorRT-LLM. But that's not what this test is about. This is just the out-of-the-box experience for folks who own a GPU and want to try the new model tonight.

stevibe

401,036 просмотров • 3 месяцев назад

got gemma 4 31B with MTP running on my DGX Spark. Hermes Agent did most of the legwork. baseline vs MTP on GB10: • c=1: 3.65 → 6.37 tok/s (1.74x) • c=4: 14.34 → 23.59 tok/s (1.65x) • c=8: 14.37 → 24.18 tok/s (1.68x) google says "up to 2x" — we're not quite there but it's real, not vapor. stack: DGX Spark / GB10 + gemma-4-31b-it + gemma-4-31b-it-assistant (MTP drafter) + vLLM built from PR 41745 MTP is basically a lightweight draft model that predicts multiple tokens while the big model verifies them all at once. smaller model does the busywork, bigger model just says yes/no. simple idea, weird to implement. next: tune the draft block size and see if we can push past 2x. also want to try it with Hermes Agent feeding prompts end to end. p.s: this was all done from telegram. Google DeepMind NVIDIA AI Developer

got gemma 4 31B with MTP running on my DGX Spark. Hermes Agent did most of the legwork. baseline vs MTP on GB10: • c=1: 3.65 → 6.37 tok/s (1.74x) • c=4: 14.34 → 23.59 tok/s (1.65x) • c=8: 14.37 → 24.18 tok/s (1.68x) google says "up to 2x" — we're not quite there but it's real, not vapor. stack: DGX Spark / GB10 + gemma-4-31b-it + gemma-4-31b-it-assistant (MTP drafter) + vLLM built from PR 41745 MTP is basically a lightweight draft model that predicts multiple tokens while the big model verifies them all at once. smaller model does the busywork, bigger model just says yes/no. simple idea, weird to implement. next: tune the draft block size and see if we can push past 2x. also want to try it with Hermes Agent feeding prompts end to end. p.s: this was all done from telegram. Google DeepMind NVIDIA AI Developer

Joey

23,088 просмотров • 2 месяцев назад

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

52,588 просмотров • 3 месяцев назад

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

elvis

261,622 просмотров • 2 лет назад

Announcing Cloudy, a platform that seamlessly handles your training infrastructure. You can rent a single H100 or a cluster of 1000 H100s, manage petabyte-scale storage volumes & seamlessly go from running experiments to managing large scale training runs on a single interface.

Announcing Cloudy, a platform that seamlessly handles your training infrastructure. You can rent a single H100 or a cluster of 1000 H100s, manage petabyte-scale storage volumes & seamlessly go from running experiments to managing large scale training runs on a single interface.

naklecha

33,138 просмотров • 9 месяцев назад

Timelapse #156 (36 hrs) - Worked with the tiny corp on getting GLM 5.2 running on 8xMI300X (sglang won here) - Launched KernelBench-Mega and updated Kernelbench-Hard with h100 and b200 sweeps - Took care of boring business stuff - Did some training sweeps for specialized technical vocab audio model - Some bugs with putting kernelbench-mega and hard on cloud instances so had to do some reruns. Learned a lot though - Setting up my own local rl infra and profiling concurrency 128 rollouts with vllm. Became clear to me that I need to serve in nvfp4, use MoE only for throughput, reap so training doesn’t OOM, dig into the vllm kernel graph itself to not underutilize my hardware from poor flashinfer/cutlass selections for my rtx pro 6000 sm120 architecture - Might do online distillation from glm 5.2 but for now taking it one step at a time - Slept for a bit then woke up and showered - Fixed an issue with SGLang tensor parallel deadlock on GLM 5.2 architecture with MTP enabled - GLM 5.2 inference is 2-3x faster than coding plans and running on amd boxes - Spent time with family - Hung out with some friends - Recorded some yoctogpt lectures with the revamped notebook (high taste btw) - Setting up dflash training for GLM 5.2

Timelapse #156 (36 hrs) - Worked with the tiny corp on getting GLM 5.2 running on 8xMI300X (sglang won here) - Launched KernelBench-Mega and updated Kernelbench-Hard with h100 and b200 sweeps - Took care of boring business stuff - Did some training sweeps for specialized technical vocab audio model - Some bugs with putting kernelbench-mega and hard on cloud instances so had to do some reruns. Learned a lot though - Setting up my own local rl infra and profiling concurrency 128 rollouts with vllm. Became clear to me that I need to serve in nvfp4, use MoE only for throughput, reap so training doesn’t OOM, dig into the vllm kernel graph itself to not underutilize my hardware from poor flashinfer/cutlass selections for my rtx pro 6000 sm120 architecture - Might do online distillation from glm 5.2 but for now taking it one step at a time - Slept for a bit then woke up and showered - Fixed an issue with SGLang tensor parallel deadlock on GLM 5.2 architecture with MTP enabled - GLM 5.2 inference is 2-3x faster than coding plans and running on amd boxes - Spent time with family - Hung out with some friends - Recorded some yoctogpt lectures with the revamped notebook (high taste btw) - Setting up dflash training for GLM 5.2

Elliot Arledge

45,447 просмотров • 1 месяц назад

"Mac Minis for example are a very good fit" - Andrej Karpathy Andrej Karpathy shouted out my work on EXO Labs in his keynote at Y Combinator AI SUS! Here's the breakdown: Right now most AI workloads run in the cloud where requests from different users are continuously batched together. These workloads are FLOPS-bound and favors hardware with the best unit economics of $ per FLOP, i.e. enterprise GPUs. The personal computing revolution will shift these workloads to personal devices with lower batch sizes (mostly batch_size=1). batch_size=1 inference is memory-bound, because all of the model parameters need to be loaded into the GPU every time a token is generated. Apple Silicon with its Unified Memory architecture has a lot of memory and memory bandwidth per $ compared to other hardware: - M4 Pro Mac Mini, 24GB @ 273GB/s, $58.33/GB, $5.13/GB/s - H100, 80GB @ 3350GB/s, $625/GB, $14.93/GB/s The unit economics of Apple Silicon are becoming more compelling with every release of the Mac. The future of AI inference looks more like open weights models (OpenAI open weights model soonTM) run at low batch_size on personal devices.

"Mac Minis for example are a very good fit" - Andrej Karpathy Andrej Karpathy shouted out my work on EXO Labs in his keynote at Y Combinator AI SUS! Here's the breakdown: Right now most AI workloads run in the cloud where requests from different users are continuously batched together. These workloads are FLOPS-bound and favors hardware with the best unit economics of $ per FLOP, i.e. enterprise GPUs. The personal computing revolution will shift these workloads to personal devices with lower batch sizes (mostly batch_size=1). batch_size=1 inference is memory-bound, because all of the model parameters need to be loaded into the GPU every time a token is generated. Apple Silicon with its Unified Memory architecture has a lot of memory and memory bandwidth per $ compared to other hardware: - M4 Pro Mac Mini, 24GB @ 273GB/s, $58.33/GB, $5.13/GB/s - H100, 80GB @ 3350GB/s, $625/GB, $14.93/GB/s The unit economics of Apple Silicon are becoming more compelling with every release of the Mac. The future of AI inference looks more like open weights models (OpenAI open weights model soonTM) run at low batch_size on personal devices.

Alex Cheema

103,566 просмотров • 1 год назад

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

157,390 просмотров • 2 месяцев назад

🎉 Congrats to Fish Audio on launching Fish Audio S2, a frontier TTS model with fine-grained prosody & emotion control via natural-language inline tags. SGLang Day-0 support is now live! 🏆 Best WER on Seed-TTS Eval; 81.88% win rate on EmergentTTS-Eval 🎙️ Voice cloning with 86.4% prefix-cache hit rate via RadixAttention ⚡️ RTF 0.34, 63.3 tok/s on single H200 (single batch) 🌍 Trained on 10M+ hours of audio across ~100 languages, GRPO-aligned 🔧 Dual-AR (Slow + Fast AR) is LLM-isomorphic: continuous batching, paged KV cache & CUDA graphs inherited natively 🗣️ Native multi-speaker: turn-taking, interruptions & cross-speaker emotion in a single pass 👉Cookbook: 👉Blog: 🎬 Curious how to run with SGLang? Check out this voice cloning demo from Chayenne Zhao with Fishaudio-S2-Pro:

🎉 Congrats to Fish Audio on launching Fish Audio S2, a frontier TTS model with fine-grained prosody & emotion control via natural-language inline tags. SGLang Day-0 support is now live! 🏆 Best WER on Seed-TTS Eval; 81.88% win rate on EmergentTTS-Eval 🎙️ Voice cloning with 86.4% prefix-cache hit rate via RadixAttention ⚡️ RTF 0.34, 63.3 tok/s on single H200 (single batch) 🌍 Trained on 10M+ hours of audio across ~100 languages, GRPO-aligned 🔧 Dual-AR (Slow + Fast AR) is LLM-isomorphic: continuous batching, paged KV cache & CUDA graphs inherited natively 🗣️ Native multi-speaker: turn-taking, interruptions & cross-speaker emotion in a single pass 👉Cookbook: 👉Blog: 🎬 Curious how to run with SGLang? Check out this voice cloning demo from Chayenne Zhao with Fishaudio-S2-Pro:

LMSYS Org

39,939 просмотров • 4 месяцев назад

SGLang now supports DSpark, enabling confidence-driven, variable-length verification for speculative decoding 🎉 DSpark addresses a key bottleneck under load: instead of verifying every draft token, it verifies only where the draft model is confident, so the gains hold even as batch size scales. We heavily optimized variable-length verification in SGLang. Across batch sizes 1 to 256, DSpark gives the best throughput/latency tradeoff on DeepSeek-V4-Flash, ahead of both MTP and non-spec. At high concurrency, dynamic scheduling provides up to ~20% higher throughput compared to a fixed budget, while maintaining high verification quality across workloads. With fused kernels and zero-overhead scheduling, DeepSeek-V4-Pro reaches 383.7 tok/s at B=1 on B300. DSpark is now available in SGLang with support for Qwen3 and DeepSeek-V4. Thanks DeepSeek for open-sourcing! Blog with full technical details and commands to run below 👇

SGLang now supports DSpark, enabling confidence-driven, variable-length verification for speculative decoding 🎉 DSpark addresses a key bottleneck under load: instead of verifying every draft token, it verifies only where the draft model is confident, so the gains hold even as batch size scales. We heavily optimized variable-length verification in SGLang. Across batch sizes 1 to 256, DSpark gives the best throughput/latency tradeoff on DeepSeek-V4-Flash, ahead of both MTP and non-spec. At high concurrency, dynamic scheduling provides up to ~20% higher throughput compared to a fixed budget, while maintaining high verification quality across workloads. With fused kernels and zero-overhead scheduling, DeepSeek-V4-Pro reaches 383.7 tok/s at B=1 on B300. DSpark is now available in SGLang with support for Qwen3 and DeepSeek-V4. Thanks DeepSeek for open-sourcing! Blog with full technical details and commands to run below 👇

LMSYS Org

168,185 просмотров • 23 дней назад

GPU tradeoff series: A100 is not much more powerful than 4090 🫠 GPU Perf and Price: - 4090: 330 fp16 TFLOPs, $1,749 - A100 (80GB): 312 fp16 TFLOPs, $20,000 > A100 is 11.4X more pricy Training speed for GPT-2(124M) with llm.c: - 4090: 153K tokens/s - A100 (80GB): 195K tokens/s > A100 is only 1.3X faster (both trained using a single card, A100 llm.c training is shown in the video, 4090 video is in the quoted tweet) Conclusion: 4090 has a much better cost vs performance ratio Why: As in the H100 vs. 4090 comparison, the biggest difference between A100 and 4090 is their GPU memory size/bandwidth and cross-GPU communication bandwidth, which does not matter too much if your model can fit into a single 4090. Specs: 4090: - GPU memory size: 24GB - memory bandwidth: 1 TB/s - communication bandwidth: 64 GB/s A100: - GPU memory size: 80GB - memory bandwidth: 2 TB/s - communication bandwidth: 900 GB/s Nvidia killed off NVLink (a high-speed communication link that connects GPUs) on 4090. (Jensen Huang smiling face) If multiple 4090s could be interconnected via NVLink, their performance would be closer to datacenter-grade A100 GPUs, even for training larger models. Additionally, 4090 isn't allowed in datacenters, that's how Nvidia makes 💰💰💰

GPU tradeoff series: A100 is not much more powerful than 4090 🫠 GPU Perf and Price: - 4090: 330 fp16 TFLOPs, $1,749 - A100 (80GB): 312 fp16 TFLOPs, $20,000 > A100 is 11.4X more pricy Training speed for GPT-2(124M) with llm.c: - 4090: 153K tokens/s - A100 (80GB): 195K tokens/s > A100 is only 1.3X faster (both trained using a single card, A100 llm.c training is shown in the video, 4090 video is in the quoted tweet) Conclusion: 4090 has a much better cost vs performance ratio Why: As in the H100 vs. 4090 comparison, the biggest difference between A100 and 4090 is their GPU memory size/bandwidth and cross-GPU communication bandwidth, which does not matter too much if your model can fit into a single 4090. Specs: 4090: - GPU memory size: 24GB - memory bandwidth: 1 TB/s - communication bandwidth: 64 GB/s A100: - GPU memory size: 80GB - memory bandwidth: 2 TB/s - communication bandwidth: 900 GB/s Nvidia killed off NVLink (a high-speed communication link that connects GPUs) on 4090. (Jensen Huang smiling face) If multiple 4090s could be interconnected via NVLink, their performance would be closer to datacenter-grade A100 GPUs, even for training larger models. Additionally, 4090 isn't allowed in datacenters, that's how Nvidia makes 💰💰💰

Yuchen Jin

234,951 просмотров • 1 год назад

$AMD $NVDA & the AMD Bear SemiAnalysis 🧵 Here are some facts: $META allocated 42% AI GPUs to $AMD OpenAI allocated 6GW(38%) to $AMD 1. Model-Specific Bias: Llama 3.3 70B graph favored NVIDIA due to TRT-LLM optimizations, highlighting throughput and latency where Blackwell excels. In contrast, the GPT-OSS 120B chart shifts focus to cost and interactivity, where MI355X shines. This selective model choice clearly suggests SemiAnalysis tailors benchmarks to reinforce narratives—NVIDIA’s dominance in speed (Llama 3.3) and AMD’s niche in cost (GPT-OSS). GPT-OSS 120B, with its sparse attention mechanisms (similar to DeepSeek-V3.2-Exp), shows AMD’s CDNA 4 architecture, while Llama 3.3’s dense attention favors NVIDIA’s Tensor Cores. SemiAnalysis’ decision to emphasize Llama 3.3 initially could reflect its AMD bear stance. 2. The way Data is presented The Llama 3.3 graph focused on raw performance metrics (throughput vs. latency), downplaying cost, where AMD holds an edge. This new chart, buried in follow-up posts, reveals AMD’s strength but receives less prominence, suggesting a curated narrative. Labeling variability (e.g., B200 with/without TRT) and the lack of uniform scaling across graphs indicate potential cherry-picking of configurations to favor NVIDIA’s optimized setups. 3. Historical Context: SemiAnalysis’ past critiques of AMD’s R&D and ROCm (web results from May 2025) align with a bearish outlook. Their own hype/brand around NVIDIA’s 15x ROI contrasts with muted coverage of AMD’s cost advantages, reinforcing bias. Despite AMD’s participation in InferenceMAX, the benchmark’s framing (e.g., prioritizing Blackwell’s ROI) reflect SemiAnalysis’ market predictions rather than balanced analysis. Lastly, AMD’s Instinct MI355X proves superior in inference and cost per million tokens for the GPT-OSS 120B model, offering a 25% cost advantage over NVIDIA’s H200 at moderate-to-high interactivity levels. This efficiency, driven by AMD’s memory bandwidth and FP4 support, makes it a better choice for cost-sensitive, multi-user deployments over a three-year horizon. However, SemiAnalysis’ sole focus(presentation graph) on Llama 3.3—where NVIDIA excels demonstrates a pattern of cherry-picking models and data to favor NVIDIA , consistent with its historical AMD bearish stance. This selective presentation risks misleading stakeholders by overshadowing AMD economic strengths. My personal take: I would trust Dr. Lisa Su, and Greg Brockman Sam Altman take on AMD and how they viewed and allocated 6GW for AMD over SemiAnalysis . At the end of the day, Large customers pay when it works. $Meta allocated 42% AI GPUs to $AMD for a reason. And the "secret weapon" will improve energy consumption by 20-50%, meaning at 6GW, OpenAI would be able to deploy 25-50% more MI450 at a much better cost advantage, higher memory bandwidth, and the queen of Inference! Oh and ROCm 8 is expected to be on par with CUDA in 2026.

Mike

104,194 просмотров • 9 месяцев назад

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

Alok

60,866 просмотров • 1 месяц назад

day 2 of building a self-driving power wheel today i officially trained a self driving model from scratch and deploy it on the car by just simply brute forcing everything, I: > made a remote tele-op and remote data collection app built on LiveKit infra > feat: 60ms e2e latency between the car and inference compute (car and compute in vietnam with singapore sfu) > feat: data is collected on operator side, baking latency into observation space itself (I expect this made the model more robust against latency) > recorded 30 min of data at 30fps and converted the dataset to lerobot (you can check a sample here) > trained a simple ACT model (3 epoch, batch size 8) to drive the car around my house > deployed the model on the car with remote inference the video explains everything shortly reflection: > the model is ofc bad, idt behavior cloning would work at all for such complex task on such small sample size > it did work in some cases where the observation is well within distribution, even generalizes to back the car when it gets stuck up next: > will hack alpamayo (NVIDIA) or comma ’s e2e to somehow fit this > or train with a llm backbone or a locomotion prior to see if it generalizes

day 2 of building a self-driving power wheel today i officially trained a self driving model from scratch and deploy it on the car by just simply brute forcing everything, I: > made a remote tele-op and remote data collection app built on LiveKit infra > feat: 60ms e2e latency between the car and inference compute (car and compute in vietnam with singapore sfu) > feat: data is collected on operator side, baking latency into observation space itself (I expect this made the model more robust against latency) > recorded 30 min of data at 30fps and converted the dataset to lerobot (you can check a sample here) > trained a simple ACT model (3 epoch, batch size 8) to drive the car around my house > deployed the model on the car with remote inference the video explains everything shortly reflection: > the model is ofc bad, idt behavior cloning would work at all for such complex task on such small sample size > it did work in some cases where the observation is well within distribution, even generalizes to back the car when it gets stuck up next: > will hack alpamayo (NVIDIA) or comma ’s e2e to somehow fit this > or train with a llm backbone or a locomotion prior to see if it generalizes

Binh Pham

19,404 просмотров • 4 месяцев назад

$AMD $5 Trillion MC Is Inevitable Long Term👑 This thread will focus more on Inference! 2026 EPYC "Venice" $TSM 2nm to save Large GW Scale Inference by 40% more than Prior Turin gen. Context: EPYC Turin achieves ~$0.001 per million tokens for batch inference vs $0.02-$0.12/ million tokens as I wrote the thread below. Venice is going to lower cost down to $0.0005-$0.0006/Million Tokens. OpenAI spent roughly $20B on Inference and Training, where 80-90% of that was for Inference per Analysts. AKA Renting Compute is Expensive AF! In this thread, I want to focus on why most analysts and investors are underestimating the role EPYC "Venice" and future Gen on overall Data center revenue. And $TSM ramping up 2nm supply early is a confirmation that AMD will be a major buyer long term. I will also link the thread the Gap between AMD Analysts & Reality and 2nm Ramp Thread so you have more comprehensive view of what I'm writing here. Before I go into detail this is my 2026 Projection: AI GPUs: $35-$50B EPYC Data Center: $15B-$17B Client Segment: $12-$13B Gaming: $6B Embedded: $4B-$5B Total Revenue $70-$100B Non-GAAP net income $18B-$25B Non-GAAP EPS $10.97-$15.40 Foward P/E 55x-70x= $603-$1,078 AMD's Analysts are projecting $0 Revenue for MI450 and sluggish EPYC Growth. Meaning, all analysts are either full of 💩 or Sexist, you decide! Analysts are also projecting 0% growth on AMD "Secret Weapon" Chip as $MSFT said we are at significant Windows refresh and upgrade cycle. Do you think TSMC would allocate more 2nm supply to $AMD at $0 MI450 revenue and sluggish EPYC? 1. EPYC is going to be the leader in lowest Inference! Current Turin cost saving is 95% vs $NVDA or 98-99% on Inference cost when you factor in renting Inference compute from Amazon Web Services, Microsoft Azure, or $NVDA Neocloud pets. TSMC claimed: 10-15% higher performance at iso-power, 25-30% lower power at iso-speed, and ~15% higher transistor density compared to 3nm. This reduces operational expenses (energy, cooling) while increasing throughput per chip. EPYC Turin achieves ~$0.001 per million tokens for batch inference (via vLLM on models like Llama 3 70B), driven by high core counts and low hardware costs. EPYC Venice offers ~1.7x overall performance and up to 70% more compute capability per core, with up to 256 cores (512 threads). Enhanced vector/AI instructions and open-source firmware (openSIL) optimize for inference workloads. AMD Incorporates AI Engines (now part of AMD's XDNA) for on-chip acceleration, improving efficiency for low-latency and edge inference. This reduces reliance on discrete GPUs, lowering system complexity and TCO. Venice SKUs are projected at $3,000-$15,000 ($5,000 for 256-core flagship), far below NVIDIA Rubin ($50,000-$90,000) or AMD's own MI450 GPUs ($40,000-$50,000). High memory bandwidth (up to 1.6 TB/s) supports efficient batch inference. Venice is designed exactly for Large customers that want to lower Inference Cost and MI450 Helios is for Customers that want Training at lowest TCO, TDP as well as lower Upfront 1GW scale(Full build $35-$40B vs $NVDA $55B-$80B). 2. Real World Example: OpenAI's 2025 inference spend reached ~$20B, escalating to even higher total compute rental (mostly inference) amid token volume growth(from video generating). By 2026, with usage doubling (consistent with industry trends: token demand grows 2-5x YoY), assume OpenAI processes ~1,800 billion million-tokens annually $NVDA Blackwell at $0.02-$0.12 is $36B(most optimized) Rubin is projected to be at $0.01/million tokens or $18B annual Inference Cost vs $AMD Venice $0.0005/million tokens or $0.9B annual Inference Cost => Massive saving for OpenAI or anyone that are paying 80-90% Annual Bill for Inference compute. In short, it is unsustainable to pay this much rent vs owning for all current AI players for the medium to long term. Rubin excels in low-latency decode (if Groq integration from $20B deal in 2027-2028), but Venice dominates batch (80% of inference by 2030). Actual savings depend on deployment scale (OpenAI's 6GW AMD plans), electricity rates, and software maturity. If Rubin only hits $0.03, savings swell to $53.1B vs. $17.1B. 3. Will running Inference on Venice and future Gen slow down response generation in 2026 and beyond? Human perception of "fast enough" for chat, agents, search augmentation, summarization, coding assistance is roughly Meaning, EPYC may generate $100B a year on data center revenue, Hence $MSFT $AMZN $META $GOOGL OpenAI xAI and 42+ Countries are leaning AMD for Inference, because the cost saving is MASSIVE! 4. Regular users (you, me, people using ChatGPT, Claude, Gemini, Grok, Perplexity...) are extremely unlikely to notice any slowdown and in many cases might even experience slightly faster or more consistent response times if the industry heavily shifts toward AMD EPYC for inference. What actually happens when companies save massively on inference? When OpenAI , Anthropic , Gemini , Grok Meta .... save billions on the batch/enterprise/RAG layer using EPYC Venice, they typically do one or more of these things with the savings, none of which make your chat slower but enhancing their bottom line(Profit) ~Keep prices the same → make more profit ~Lower subscription prices / increase free tier limits ~Train bigger & better models more frequently ~Offer longer context windows ~Add more reasoning steps / tool calls / agents per query ~Improve multimodal capabilities ~Build more data centers / reduce throttling during peaks In practice the consumer experience usually gets better, not worse, when inference becomes dramatically cheaper. Prime example is $META leaning AMD heavily or currently AMD largest customer. or Grok 2 to Grok 3 heavily used AMD for Inference saving. And most Grok Users reported Groke responses snappier, not slower. 5. What does this mean for potential Revenue? Noted that TSMC is massively ramping 2nm supply for $AMD both MI450 and EPYC. EPYC Conservative projection: FY2025: $10.5B(best Est) FY2026: $16B FY2027: $29B FY2028: $49B FY2029: $75B FY2030: $100B Large customers: $META OpenAI $MSFT $AMZN $GOOGL xAI (Apple?) Smaller customer: $DELL $HPE $SMCI and 42+ other countries. The roadmap to $5 Trillion is very much inevitable as Inference Cost from Renting or owning $NVDA are too high, but $NVDA will still dominate Training market share, where MI families are likely to take 15-20% market share, but the TAM is also expanding Rapidly. Most Institutions are projecting $2-$3Trillion TAM by 2030. $NVDA said $4 Trillion. Dr. Lisa Su said $1 Trillion+ by 2030. So you decide on how much TAM. If you enjoy this kind of analysis, Slap the Like/Repost and Bookmark to please the X Algo as it is Free.99! If you want to support my work further, consider subscribe to see more in-depth analysis! Alright, that is it. Not Financial Advice!

$AMD $5 Trillion MC Is Inevitable Long Term👑 This thread will focus more on Inference! 2026 EPYC "Venice" $TSM 2nm to save Large GW Scale Inference by 40% more than Prior Turin gen. Context: EPYC Turin achieves ~$0.001 per million tokens for batch inference vs $0.02-$0.12/ million tokens as I wrote the thread below. Venice is going to lower cost down to $0.0005-$0.0006/Million Tokens. OpenAI spent roughly $20B on Inference and Training, where 80-90% of that was for Inference per Analysts. AKA Renting Compute is Expensive AF! In this thread, I want to focus on why most analysts and investors are underestimating the role EPYC "Venice" and future Gen on overall Data center revenue. And $TSM ramping up 2nm supply early is a confirmation that AMD will be a major buyer long term. I will also link the thread the Gap between AMD Analysts & Reality and 2nm Ramp Thread so you have more comprehensive view of what I'm writing here. Before I go into detail this is my 2026 Projection: AI GPUs: $35-$50B EPYC Data Center: $15B-$17B Client Segment: $12-$13B Gaming: $6B Embedded: $4B-$5B Total Revenue $70-$100B Non-GAAP net income $18B-$25B Non-GAAP EPS $10.97-$15.40 Foward P/E 55x-70x= $603-$1,078 AMD's Analysts are projecting $0 Revenue for MI450 and sluggish EPYC Growth. Meaning, all analysts are either full of 💩 or Sexist, you decide! Analysts are also projecting 0% growth on AMD "Secret Weapon" Chip as $MSFT said we are at significant Windows refresh and upgrade cycle. Do you think TSMC would allocate more 2nm supply to $AMD at $0 MI450 revenue and sluggish EPYC? 1. EPYC is going to be the leader in lowest Inference! Current Turin cost saving is 95% vs $NVDA or 98-99% on Inference cost when you factor in renting Inference compute from Amazon Web Services, Microsoft Azure, or $NVDA Neocloud pets. TSMC claimed: 10-15% higher performance at iso-power, 25-30% lower power at iso-speed, and ~15% higher transistor density compared to 3nm. This reduces operational expenses (energy, cooling) while increasing throughput per chip. EPYC Turin achieves ~$0.001 per million tokens for batch inference (via vLLM on models like Llama 3 70B), driven by high core counts and low hardware costs. EPYC Venice offers ~1.7x overall performance and up to 70% more compute capability per core, with up to 256 cores (512 threads). Enhanced vector/AI instructions and open-source firmware (openSIL) optimize for inference workloads. AMD Incorporates AI Engines (now part of AMD's XDNA) for on-chip acceleration, improving efficiency for low-latency and edge inference. This reduces reliance on discrete GPUs, lowering system complexity and TCO. Venice SKUs are projected at $3,000-$15,000 ($5,000 for 256-core flagship), far below NVIDIA Rubin ($50,000-$90,000) or AMD's own MI450 GPUs ($40,000-$50,000). High memory bandwidth (up to 1.6 TB/s) supports efficient batch inference. Venice is designed exactly for Large customers that want to lower Inference Cost and MI450 Helios is for Customers that want Training at lowest TCO, TDP as well as lower Upfront 1GW scale(Full build $35-$40B vs $NVDA $55B-$80B). 2. Real World Example: OpenAI's 2025 inference spend reached ~$20B, escalating to even higher total compute rental (mostly inference) amid token volume growth(from video generating). By 2026, with usage doubling (consistent with industry trends: token demand grows 2-5x YoY), assume OpenAI processes ~1,800 billion million-tokens annually $NVDA Blackwell at $0.02-$0.12 is $36B(most optimized) Rubin is projected to be at $0.01/million tokens or $18B annual Inference Cost vs $AMD Venice $0.0005/million tokens or $0.9B annual Inference Cost => Massive saving for OpenAI or anyone that are paying 80-90% Annual Bill for Inference compute. In short, it is unsustainable to pay this much rent vs owning for all current AI players for the medium to long term. Rubin excels in low-latency decode (if Groq integration from $20B deal in 2027-2028), but Venice dominates batch (80% of inference by 2030). Actual savings depend on deployment scale (OpenAI's 6GW AMD plans), electricity rates, and software maturity. If Rubin only hits $0.03, savings swell to $53.1B vs. $17.1B. 3. Will running Inference on Venice and future Gen slow down response generation in 2026 and beyond? Human perception of "fast enough" for chat, agents, search augmentation, summarization, coding assistance is roughly Meaning, EPYC may generate $100B a year on data center revenue, Hence $MSFT $AMZN $META $GOOGL OpenAI xAI and 42+ Countries are leaning AMD for Inference, because the cost saving is MASSIVE! 4. Regular users (you, me, people using ChatGPT, Claude, Gemini, Grok, Perplexity...) are extremely unlikely to notice any slowdown and in many cases might even experience slightly faster or more consistent response times if the industry heavily shifts toward AMD EPYC for inference. What actually happens when companies save massively on inference? When OpenAI , Anthropic , Gemini , Grok Meta .... save billions on the batch/enterprise/RAG layer using EPYC Venice, they typically do one or more of these things with the savings, none of which make your chat slower but enhancing their bottom line(Profit) ~Keep prices the same → make more profit ~Lower subscription prices / increase free tier limits ~Train bigger & better models more frequently ~Offer longer context windows ~Add more reasoning steps / tool calls / agents per query ~Improve multimodal capabilities ~Build more data centers / reduce throttling during peaks In practice the consumer experience usually gets better, not worse, when inference becomes dramatically cheaper. Prime example is $META leaning AMD heavily or currently AMD largest customer. or Grok 2 to Grok 3 heavily used AMD for Inference saving. And most Grok Users reported Groke responses snappier, not slower. 5. What does this mean for potential Revenue? Noted that TSMC is massively ramping 2nm supply for $AMD both MI450 and EPYC. EPYC Conservative projection: FY2025: $10.5B(best Est) FY2026: $16B FY2027: $29B FY2028: $49B FY2029: $75B FY2030: $100B Large customers: $META OpenAI $MSFT $AMZN $GOOGL xAI (Apple?) Smaller customer: $DELL $HPE $SMCI and 42+ other countries. The roadmap to $5 Trillion is very much inevitable as Inference Cost from Renting or owning $NVDA are too high, but $NVDA will still dominate Training market share, where MI families are likely to take 15-20% market share, but the TAM is also expanding Rapidly. Most Institutions are projecting $2-$3Trillion TAM by 2030. $NVDA said $4 Trillion. Dr. Lisa Su said $1 Trillion+ by 2030. So you decide on how much TAM. If you enjoy this kind of analysis, Slap the Like/Repost and Bookmark to please the X Algo as it is Free.99! If you want to support my work further, consider subscribe to see more in-depth analysis! Alright, that is it. Not Financial Advice!

Mike

102,223 просмотров • 6 месяцев назад