Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Wide Expert Parallelism increases the total memory bandwidth available per MoE deployment. This means the model distributes the MoE expert weights across multiple GPUs, so each GPU only needs to load a tiny fraction of the weights. This translates to higher throughput per GPU, increasing perf per dollar and... show more

SemiAnalysis

111,566 subscribers

30,002 Aufrufe • vor 7 Tagen •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

$Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in memory (not just the active parameters). For an MOE at each token you access basically a random subset of the model. Swapping large parts of the model to "disk" from token-to-token is too slow. - Comparatively you don't need as much memory bandwidth. Only a small fraction of the weights are used per token. In the case of DeepSeek v3 37B / 671B are active. So only ~5% of the weights are moved to GPU cache / register for each token. (SVG animation made with the help of DeepSeek V2 1210 + MLX on an M2 Ultra)$

Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in memory (not just the active parameters). For an MOE at each token you access basically a random subset of the model. Swapping large parts of the model to "disk" from token-to-token is too slow. - Comparatively you don't need as much memory bandwidth. Only a small fraction of the weights are used per token. In the case of DeepSeek v3 37B / 671B are active. So only ~5% of the weights are moved to GPU cache / register for each token. (SVG animation made with the help of DeepSeek V2 1210 + MLX on an M2 Ultra)

Awni Hannun

27,452 Aufrufe • vor 1 Jahr

MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25% MTP drafts several tokens ahead and verifies them in one pass. The speedup depends on memory moved per pass. Dense 27B reads all 27B params per token, MoE 35B-A3B only reads 3B active. Dense had way more to save by batching. The baseline tps also differ (218 vs 51) for the same reason from the other side. Token generation is memory-bandwidth bound, and MoE moves ~8x less memory per token, so its baseline is already 4x ahead. ~80% draft acceptance. Zero accuracy loss. ~1 GB extra VRAM. Open-source code and local AI app – in the comments 👇

MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25% MTP drafts several tokens ahead and verifies them in one pass. The speedup depends on memory moved per pass. Dense 27B reads all 27B params per token, MoE 35B-A3B only reads 3B active. Dense had way more to save by batching. The baseline tps also differ (218 vs 51) for the same reason from the other side. Token generation is memory-bandwidth bound, and MoE moves ~8x less memory per token, so its baseline is already 4x ahead. ~80% draft acceptance. Zero accuracy loss. ~1 GB extra VRAM. Open-source code and local AI app – in the comments 👇

atomic.chat

169,430 Aufrufe • vor 1 Monat

Native time-tracking in Notion This setup lets you track multiple sessions per task, and shows the total time you've worked across all of them.

Native time-tracking in Notion This setup lets you track multiple sessions per task, and shows the total time you've worked across all of them.

Thomas Frank

12,475 Aufrufe • vor 1 Jahr

AN AWS ENGINEER QUIETLY BUILT A 2 PETABYTE HOME SERVER FOR $9/MONTH THAT KILLS A $3,400/MONTH CLOUD STORAGE BILL the lenovo thinkstation pgx ships nvidia's gb10 grace blackwell superchip and 128gb of unified memory in a box the size of a mac mini at 1.2kg it runs an 80b qwen3 coder model at 25 to 40 tokens per second and a 196b step-3.5-flash moe model at 20 tokens per second locally the gb10 packs 6,144 cuda cores, 192 fifth-generation tensor cores and rates at 1 petaflop of fp4 with sparsity from a single 240 watt usb-c power supply fine tuning qwen 2.5 7b with lora took 18 minutes and 41gb of unified memory while the gpu pulled 65 watts and peaked at 77 degrees the box pulls a docker container from nvidia's registry and serves a frontier model on your local network with tool calling and zero data leaving your desk bookmark this and read the article below

AN AWS ENGINEER QUIETLY BUILT A 2 PETABYTE HOME SERVER FOR $9/MONTH THAT KILLS A $3,400/MONTH CLOUD STORAGE BILL the lenovo thinkstation pgx ships nvidia's gb10 grace blackwell superchip and 128gb of unified memory in a box the size of a mac mini at 1.2kg it runs an 80b qwen3 coder model at 25 to 40 tokens per second and a 196b step-3.5-flash moe model at 20 tokens per second locally the gb10 packs 6,144 cuda cores, 192 fifth-generation tensor cores and rates at 1 petaflop of fp4 with sparsity from a single 240 watt usb-c power supply fine tuning qwen 2.5 7b with lora took 18 minutes and 41gb of unified memory while the gpu pulled 65 watts and peaked at 77 degrees the box pulls a docker container from nvidia's registry and serves a frontier model on your local network with tool calling and zero data leaving your desk bookmark this and read the article below

starmex

190,123 Aufrufe • vor 10 Tagen

MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting Contributions: • MoE-GS: the first dynamic Gaussian splatting framework employing a Mixture-of-Experts architecture, enabling robust and adaptive reconstruction across diverse dynamic scenes. • A novel Volume-aware Pixel Router integrates expert outputs through differentiable weight splatting, achieving spatially and temporally coherent adaptive blending. • Efficiency of MoE-GS is improved through single-pass multi-expert rendering and gate-aware Gaussian pruning. A separate knowledge distillation strategy trains individual experts with pseudo-labels from the MoE model, enhancing quality without modifying the architecture.

MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting Contributions: • MoE-GS: the first dynamic Gaussian splatting framework employing a Mixture-of-Experts architecture, enabling robust and adaptive reconstruction across diverse dynamic scenes. • A novel Volume-aware Pixel Router integrates expert outputs through differentiable weight splatting, achieving spatially and temporally coherent adaptive blending. • Efficiency of MoE-GS is improved through single-pass multi-expert rendering and gate-aware Gaussian pruning. A separate knowledge distillation strategy trains individual experts with pseudo-labels from the MoE model, enhancing quality without modifying the architecture.

MrNeRF

10,346 Aufrufe • vor 8 Monaten

leeknow acting cute in front of their juniors as per han's request (again) 😭 🐰 enjoy your drink, moe moe kkyu~

leeknow acting cute in front of their juniors as per han's request (again) 😭 🐰 enjoy your drink, moe moe kkyu~

❀ tyne.

52,287 Aufrufe • vor 1 Monat

this is the worst local AI will ever be. tomorrow it gets faster. next month the models get smarter. next year your GPU runs what a data center runs today. Qwen3.5-35B-A3B on a single 3090. told it to visualize its own expert routing. 256 experts, 8 active per token, rendered in 3D on the same GPU running inference. no API key. no subscription. no permission needed. closed AI isn't losing ground. it's losing the argument.

this is the worst local AI will ever be. tomorrow it gets faster. next month the models get smarter. next year your GPU runs what a data center runs today. Qwen3.5-35B-A3B on a single 3090. told it to visualize its own expert routing. 256 experts, 8 active per token, rendered in 3D on the same GPU running inference. no API key. no subscription. no permission needed. closed AI isn't losing ground. it's losing the argument.

Sudo su

106,710 Aufrufe • vor 3 Monaten

Newborns are so tiny. Most full term infants weigh only 6-8 pounds at birth. But cherish (and document) these special moments while they last, because you won’t have a little baby for long. In fact, children’s growth and development in the early years is positively exponential. Over the course of a single year, infants transform from tiny and dependent to increasingly verbal, mobile toddlers, typically tripling their birth weights. So quickly is your baby growing, in fact, that between their second and sixth months many increase their height by up to a quarter inch PER WEEK (1 inch per month). You’ll only have a tiny baby for the blink of an eye. So make each moment count. This happy little guy (1 week) was shared to TT by tai.vieira.21.

Newborns are so tiny. Most full term infants weigh only 6-8 pounds at birth. But cherish (and document) these special moments while they last, because you won’t have a little baby for long. In fact, children’s growth and development in the early years is positively exponential. Over the course of a single year, infants transform from tiny and dependent to increasingly verbal, mobile toddlers, typically tripling their birth weights. So quickly is your baby growing, in fact, that between their second and sixth months many increase their height by up to a quarter inch PER WEEK (1 inch per month). You’ll only have a tiny baby for the blink of an eye. So make each moment count. This happy little guy (1 week) was shared to TT by tai.vieira.21.

Dan Wuori

42,919 Aufrufe • vor 9 Monaten

Jenna Sudds, alongside Justin Trudeau, announced a $9.1 million investment over 3 years for New Foundland and Labrador. This is to pay for the Canadian Student Food Program. Some numbers. $9.1 million over 3 years. $3.03 million per year. 65,000 students in the province $48 per student. 190 teaching / school days Grand total is $0.25 per student per day. Wonder what kind of healthy meal each child is going to get for that?

Jenna Sudds, alongside Justin Trudeau, announced a $9.1 million investment over 3 years for New Foundland and Labrador. This is to pay for the Canadian Student Food Program. Some numbers. $9.1 million over 3 years. $3.03 million per year. 65,000 students in the province $48 per student. 190 teaching / school days Grand total is $0.25 per student per day. Wonder what kind of healthy meal each child is going to get for that?

sonofabench

168,193 Aufrufe • vor 1 Jahr

Per Nvidia CEO the next multi-trillion dollar industry is Ai Agents. $SEN | Sentio is the biggest disruptor this cycle powered by $GPU 🧠 The ecosystem is running on every major chain. NFA, bet more on the most advanced #AI #Agent deployer ecosystem ✍🏻

Per Nvidia CEO the next multi-trillion dollar industry is Ai Agents. $SEN | Sentio is the biggest disruptor this cycle powered by $GPU 🧠 The ecosystem is running on every major chain. NFA, bet more on the most advanced #AI #Agent deployer ecosystem ✍🏻

CryptoBit 🔔

46,094 Aufrufe • vor 1 Jahr

"Moe, Moe, Moe..." "When Bart's done, can we Moe to the Moevies for the Moetinee?" "Of course. All work and Moe play makes Moe a Moe Moe." "Moe Moe Moe Moe Moe?" "Moe Moe Moe." "Moe Moe Moe Moe Moe." "Moe Moe Moe Moe." "Moe."

"Moe, Moe, Moe..." "When Bart's done, can we Moe to the Moevies for the Moetinee?" "Of course. All work and Moe play makes Moe a Moe Moe." "Moe Moe Moe Moe Moe?" "Moe Moe Moe." "Moe Moe Moe Moe Moe." "Moe Moe Moe Moe." "Moe."

ShinyMcShine: Simpsons Quotes

48,229 Aufrufe • vor 1 Jahr

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Alok

290,161 Aufrufe • vor 18 Tagen

THE LOOKBOOK IN MOTION SUPER SUN ‘CORE’ Collection Available NOW at Note: Shipping starts 20 Jan 2026 Limited to 2 pieces per color, per size, per product, per order #SUPERSUN #SSCORE

THE LOOKBOOK IN MOTION SUPER SUN ‘CORE’ Collection Available NOW at Note: Shipping starts 20 Jan 2026 Limited to 2 pieces per color, per size, per product, per order #SUPERSUN #SSCORE

SUPER SUN

46,346 Aufrufe • vor 5 Monaten

I just looked this up and it’s 100% true “Canned mushrooms contain 20 maggots per can. The FDA allows this. They allow 20 per can per 100 grams” The FDA in America is completely paid off and useless ‘The FDA’s Compliance Policy Guide allows canned mushrooms to contain an average of up to 20 maggots of any size per 100 grams of drained mushrooms and proportionate liquid, or an average of 5 or more maggots that are 2 mm or longer per 100 grams’

I just looked this up and it’s 100% true “Canned mushrooms contain 20 maggots per can. The FDA allows this. They allow 20 per can per 100 grams” The FDA in America is completely paid off and useless ‘The FDA’s Compliance Policy Guide allows canned mushrooms to contain an average of up to 20 maggots of any size per 100 grams of drained mushrooms and proportionate liquid, or an average of 5 or more maggots that are 2 mm or longer per 100 grams’

Wall Street Apes

1,338,181 Aufrufe • vor 1 Jahr

If I had to bet: 3 hours per week in the gym = 60-70% of max gains 4 hours per week in the gym = 80% of max gains 5 hours per week in the gym = 90% of max gains 6 hours per week in the gym = 100% of max gains NOTE: This counts all rest/warm up time (in the door/out the door)

If I had to bet: 3 hours per week in the gym = 60-70% of max gains 4 hours per week in the gym = 80% of max gains 5 hours per week in the gym = 90% of max gains 6 hours per week in the gym = 100% of max gains NOTE: This counts all rest/warm up time (in the door/out the door)

Dean Turner

257,454 Aufrufe • vor 5 Monaten

The #Dolphins and RB De’Von Achane have agreed to a 4-year, $68M extension, per multiple reports.

The #Dolphins and RB De’Von Achane have agreed to a 4-year, $68M extension, per multiple reports.

FOX Sports: NFL

48,745 Aufrufe • vor 1 Monat

"Pakistan is the only country that has good relations with US, Russia & China. Muslim world is giving great importance to PAK" So as per Arfa & her expert, Beggar nation Pakistan is an emerging power!! Imagine saying this without laughing 😂

"Pakistan is the only country that has good relations with US, Russia & China. Muslim world is giving great importance to PAK" So as per Arfa & her expert, Beggar nation Pakistan is an emerging power!! Imagine saying this without laughing 😂

BALA

301,300 Aufrufe • vor 2 Monaten