Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

GOOGLE JUST MADE EVERY CHATBOT FEEL SLOW Diffusion Gemma 26b doesn’t predict word by word, it generates 256 tokens in parallel using bi-directional attention, like stable diffusion but for language it’s MoE so only 3.8B params activate during inference, fits on a single RTX 4090 with 18GB VRAM and... show more

leopardracer

10,035 subscribers

27,080 views • 19 days ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Gemma 4 Diffusion landed in vLLM last week. Day 0. First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel. Result: 1,000+ tokens per second at batch size 1 on a single H100. Built on Model Runner V2. Google Gemma

Gemma 4 Diffusion landed in vLLM last week. Day 0. First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel. Result: 1,000+ tokens per second at batch size 1 on a single H100. Built on Model Runner V2. Google Gemma

Red Hat AI

17,524 views • 15 days ago

New Google Gemma 4 12B claims near-26B performance - we tested both! We ran both models locally on one RTX 4090 and gave each the same task: write a self-contained HTML5 canvas animation with real physics in one file without libraries. Three scenes - a Galton board, two blocks colliding off a wall, and a chaotic triple pendulum Outputs: Gemma 4 26B-A4B: 15 GB VRAM usage, 6.9k tokens, 138 tok/s Gemma 4 12B: 9 GB VRAM usage, 8.9k tokens, 80 tok/s Same Gemma 4 family, but the 26B-A4B won every scene and ran ~1.7x faster - on just 4B active params. The 12B stayed very close though, on almost half the VRAM - which makes it the ideal model for a 16 GB laptop

New Google Gemma 4 12B claims near-26B performance - we tested both! We ran both models locally on one RTX 4090 and gave each the same task: write a self-contained HTML5 canvas animation with real physics in one file without libraries. Three scenes - a Galton board, two blocks colliding off a wall, and a chaotic triple pendulum Outputs: Gemma 4 26B-A4B: 15 GB VRAM usage, 6.9k tokens, 138 tok/s Gemma 4 12B: 9 GB VRAM usage, 8.9k tokens, 80 tok/s Same Gemma 4 family, but the 26B-A4B won every scene and ran ~1.7x faster - on just 4B active params. The 12B stayed very close though, on almost half the VRAM - which makes it the ideal model for a 16 GB laptop

atomic.chat

151,448 views • 27 days ago

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

Alok

59,908 views • 9 days ago

Google isn’t betting on a single AI architecture. Sundar Pichai, CEO of Google: “We’re going to push the diffusion paradigm as hard as possible.” “All of today’s mainline Gemini models are autoregressive. Diffusion is a different paradigm.” “For the same capability, diffusion can be much faster.” “It’s behind the mainline models today, but there will be areas where it’s the right tool.” “We’re pushing multiple directions in parallel, and bringing them together where it makes sense.”

Google isn’t betting on a single AI architecture. Sundar Pichai, CEO of Google: “We’re going to push the diffusion paradigm as hard as possible.” “All of today’s mainline Gemini models are autoregressive. Diffusion is a different paradigm.” “For the same capability, diffusion can be much faster.” “It’s behind the mainline models today, but there will be areas where it’s the right tool.” “We’re pushing multiple directions in parallel, and bringing them together where it makes sense.”

Forward Future

152,887 views • 6 months ago

DiffusionGemma can now run at 2000+ tokens/sec! ⚡ We made local DiffusionGemma inference 1.8× faster. Run it on 18GB RAM via Unsloth Studio. GitHub: Guide:

DiffusionGemma can now run at 2000+ tokens/sec! ⚡ We made local DiffusionGemma inference 1.8× faster. Run it on 18GB RAM via Unsloth Studio. GitHub: Guide:

Unsloth AI

176,258 views • 18 days ago

you can like the man or not but taylor was right with every. single. word. in this song

you can like the man or not but taylor was right with every. single. word. in this song

ver.

122,829 views • 2 years ago

I typed every word in this video using only my thoughts. It was made for the Neuralink team.

I typed every word in this video using only my thoughts. It was made for the Neuralink team.

Jake Schneider

21,512 views • 6 months ago

We've partnered to bring more Gemma 3 quantized models to you! 🚀 We worked with Georgi Gerganov llama.cpp, LM Studio, MLX, ollama to make sure you can run it using your favorite tool! Gemma models optimized with QAT, reduce memory requirements while keeping quality! All models checkpoints are available on Hugging Face and Kaggle. 🤗 What does this mean? - Gemma 3 27B (int4): Fits on NVIDIA RTX 3090 (24GB VRAM) or similar. - Gemma 3 12B (int4): Only needs a NVIDIA RTX 4060 (8GB VRAM) or similar. - Gemma 3 4B, 1B (int4): Run anything with more than 2.5GB Memory. Want to see it in action? Video below shows how easy it is to get started using LMStudio:

We've partnered to bring more Gemma 3 quantized models to you! 🚀 We worked with Georgi Gerganov llama.cpp, LM Studio, MLX, ollama to make sure you can run it using your favorite tool! Gemma models optimized with QAT, reduce memory requirements while keeping quality! All models checkpoints are available on Hugging Face and Kaggle. 🤗 What does this mean? - Gemma 3 27B (int4): Fits on NVIDIA RTX 3090 (24GB VRAM) or similar. - Gemma 3 12B (int4): Only needs a NVIDIA RTX 4060 (8GB VRAM) or similar. - Gemma 3 4B, 1B (int4): Run anything with more than 2.5GB Memory. Want to see it in action? Video below shows how easy it is to get started using LMStudio:

Philipp Schmid

15,996 views • 1 year ago

Diffusion Gemma is 4x faster, but makes 6x more mistakes! We benchmarked the new diffusion LLM against its autoregressive twin on a single H100 (FP8). We gave each the same three tasks: write a Steve Jobs biography, the history of Tetris, and the story of BeOS - every next topic less popular than the previous one. Then we fact-checked every claim in every answer. Gemma4 got 45 facts right, 5 wrong. DiffusionGemma got 33 right, 28 wrong. The less popular the topic, the worse it got: 4 mistakes on Jobs, 12 on Tetris, 12 on BeOS. It named Clara Clley as Steve Jobs' mother, invented a colleague for Pajitnov named Geri Gulovik and priced the BeBox at $9,999. The real one cost $1,600. Outputs: Gemma4 26B A4B: 218 tok/s · 15.1s total · 45 facts · 5 mistakes DiffusionGemma 26B A4B: 763 tok/s · 3.7s total · 33 facts · 28 mistakes The reason is simple. DiffusionGemma throws 256 tokens on the screen at once and polishes them pass after pass until the text sounds smooth. Smooth is all it cares about: a fake name, date or number sounds just as smooth as a real one, so it stays. Regular Gemma4 meanwhile writes one word at a time and checks every new word against everything before it. Google says it themselves in the launch post: quality is lower, use regular Gemma 4 when facts matter.

Diffusion Gemma is 4x faster, but makes 6x more mistakes! We benchmarked the new diffusion LLM against its autoregressive twin on a single H100 (FP8). We gave each the same three tasks: write a Steve Jobs biography, the history of Tetris, and the story of BeOS - every next topic less popular than the previous one. Then we fact-checked every claim in every answer. Gemma4 got 45 facts right, 5 wrong. DiffusionGemma got 33 right, 28 wrong. The less popular the topic, the worse it got: 4 mistakes on Jobs, 12 on Tetris, 12 on BeOS. It named Clara Clley as Steve Jobs' mother, invented a colleague for Pajitnov named Geri Gulovik and priced the BeBox at $9,999. The real one cost $1,600. Outputs: Gemma4 26B A4B: 218 tok/s · 15.1s total · 45 facts · 5 mistakes DiffusionGemma 26B A4B: 763 tok/s · 3.7s total · 33 facts · 28 mistakes The reason is simple. DiffusionGemma throws 256 tokens on the screen at once and polishes them pass after pass until the text sounds smooth. Smooth is all it cares about: a fake name, date or number sounds just as smooth as a real one, so it stays. Regular Gemma4 meanwhile writes one word at a time and checks every new word against everything before it. Google says it themselves in the launch post: quality is lower, use regular Gemma 4 when facts matter.

atomic.chat

75,520 views • 18 days ago

We're moving beyond autoregressive LLMs! Autoregressive LLMs generate text word-by-word, which can be slow and affect quality, while diffusion models refine noise step-by-step, allowing for faster iterations and error correction. Here's Gemini Diffusion running at 857 tokens/s:

We're moving beyond autoregressive LLMs! Autoregressive LLMs generate text word-by-word, which can be slow and affect quality, while diffusion models refine noise step-by-step, allowing for faster iterations and error correction. Here's Gemini Diffusion running at 857 tokens/s:

Akshay 🚀

34,524 views • 1 year ago

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

elvis

261,583 views • 2 years ago

Google's new Gemma 4 is excellent, and the 26B MoE version is likely the best model to run on a 32GB Framework Desktop. It's fast, smart, and also great for tool calling if you use it with OpenClaw🦞 or other local agent platforms.

Google's new Gemma 4 is excellent, and the 26B MoE version is likely the best model to run on a 32GB Framework Desktop. It's fast, smart, and also great for tool calling if you use it with OpenClaw🦞 or other local agent platforms.

Framework

58,593 views • 2 months ago

It's like ChatGPT, but for proteins 😍 See RFdiffusion in action. Instead of writing essays, this AI tool builds new protein structures from scratch, or based on your prompts (like a desired shape or motif). It's powered by a deep learning technique called diffusion models, which are also behind tools like DALL·E and Stable Diffusion. I just tried it in Google Colab — and it’s mind-blowing to watch a protein form step-by-step. If you're into protein design, synthetic biology — this is worth exploring. 🧪 Try it here:

It's like ChatGPT, but for proteins 😍 See RFdiffusion in action. Instead of writing essays, this AI tool builds new protein structures from scratch, or based on your prompts (like a desired shape or motif). It's powered by a deep learning technique called diffusion models, which are also behind tools like DALL·E and Stable Diffusion. I just tried it in Google Colab — and it’s mind-blowing to watch a protein form step-by-step. If you're into protein design, synthetic biology — this is worth exploring. 🧪 Try it here:

Rafeeque

35,853 views • 1 year ago

HERMES AGENT NOW RUNS ON AN 8GB LAPTOP GPU JUST AS EASILY AS IT RUNS ON A 128GB MINI PC Nous Research shipped the official Hermes Agent Desktop App this week. Someone pointed it at a local llama server running on an RTX 4060 with 16GB system RAM. The integration took two minutes The model behind it: Gemma 4 26B MoE, QAT quantized, running on 8GB of VRAM. A 60k token prompt held a stable 20 tokens a second, flat, no slowdown as context grew. The flags were nothing exotic, just -cmoe -c 248000 on llama.cpp What that 8GB setup does out of the box: reads and patches its own code, runs it in a terminal, debugs errors, manages GitHub repos, spawns sub-agents for parallel work. Browses the web with vision to debug a UI. Schedules cron jobs in plain language. Connects to Notion, Google Workspace, Linear, and Obsidian to manage tasks on its own That's the same agent layer running on a Minisforum MS-S1 MAX with 128GB of unified memory, 96GB of it to the GPU, holding a 120B model at 56 tokens a second instead of a 26B model at 20. Same software, same tool execution, same zero API key. The only thing that changes between an $800 laptop and a $2,000 mini PC is how big a model you can afford to run underneath it The barrier to running a real autonomous agent locally didn't just drop. It dropped all the way down to hardware most people already own

HERMES AGENT NOW RUNS ON AN 8GB LAPTOP GPU JUST AS EASILY AS IT RUNS ON A 128GB MINI PC Nous Research shipped the official Hermes Agent Desktop App this week. Someone pointed it at a local llama server running on an RTX 4060 with 16GB system RAM. The integration took two minutes The model behind it: Gemma 4 26B MoE, QAT quantized, running on 8GB of VRAM. A 60k token prompt held a stable 20 tokens a second, flat, no slowdown as context grew. The flags were nothing exotic, just -cmoe -c 248000 on llama.cpp What that 8GB setup does out of the box: reads and patches its own code, runs it in a terminal, debugs errors, manages GitHub repos, spawns sub-agents for parallel work. Browses the web with vision to debug a UI. Schedules cron jobs in plain language. Connects to Notion, Google Workspace, Linear, and Obsidian to manage tasks on its own That's the same agent layer running on a Minisforum MS-S1 MAX with 128GB of unified memory, 96GB of it to the GPU, holding a 120B model at 56 tokens a second instead of a 26B model at 20. Same software, same tool execution, same zero API key. The only thing that changes between an $800 laptop and a $2,000 mini PC is how big a model you can afford to run underneath it The barrier to running a real autonomous agent locally didn't just drop. It dropped all the way down to hardware most people already own

NO1ennn

40,079 views • 10 days ago

a new 8GB VRAM GPU dense Local LLM leader was born yesterday runs on: RTX 4060 / RTX 3070 / RTX 2080. any 8GB card Qwen 3.5 9B (dense) was the go to for 6-8GB VRAM builds. Gemma 4 12B QAT (dense) just changed that. same llama.cpp + cuda 13.2. i7 12700H. 16GB RAM. same -ngl 99 flags. same 48k context. unsloth gemma-4-12b-it-Q4_K_M.gguf → 15 tok/sec @ 48k ctx unsloth gemma-4-12B-it-qat-UD-Q4_K_XL.gguf → 32 tok/sec @ 48k ctx → 26 tok/sec @ 64k ctx 64k context is a big deal. Hermes 3 agent requires 64k minimum to run. you're now getting full hermes compatible context on a budget consumer GPU at 26 tok/sec locally. 2.1x faster on identical hardware. and here's the part that breaks your brain: the QAT-UD-Q4_K_XL is actually SMALLER than the Q4_K_M "XL" why? QAT = Quantization Aware Training Google didn't train the model first and compress it later they trained it to be quantized from day one the weights already know how to survive low precision that's why you get more quality per byte llamacpp flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 48000 -v fits in 8GB VRAM clean. no API. no cloud. no subscription. and this isn't even the MTP variant yet Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB. I have benchmarked the 26b and 31b qat as well on a single RTX 4090, checkout the comments for details. If you have a 6GB or 8GB VRAM GPU, post your numbers. more benchmarks and configs coming soon

a new 8GB VRAM GPU dense Local LLM leader was born yesterday runs on: RTX 4060 / RTX 3070 / RTX 2080. any 8GB card Qwen 3.5 9B (dense) was the go to for 6-8GB VRAM builds. Gemma 4 12B QAT (dense) just changed that. same llama.cpp + cuda 13.2. i7 12700H. 16GB RAM. same -ngl 99 flags. same 48k context. unsloth gemma-4-12b-it-Q4_K_M.gguf → 15 tok/sec @ 48k ctx unsloth gemma-4-12B-it-qat-UD-Q4_K_XL.gguf → 32 tok/sec @ 48k ctx → 26 tok/sec @ 64k ctx 64k context is a big deal. Hermes 3 agent requires 64k minimum to run. you're now getting full hermes compatible context on a budget consumer GPU at 26 tok/sec locally. 2.1x faster on identical hardware. and here's the part that breaks your brain: the QAT-UD-Q4_K_XL is actually SMALLER than the Q4_K_M "XL" why? QAT = Quantization Aware Training Google didn't train the model first and compress it later they trained it to be quantized from day one the weights already know how to survive low precision that's why you get more quality per byte llamacpp flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 48000 -v fits in 8GB VRAM clean. no API. no cloud. no subscription. and this isn't even the MTP variant yet Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB. I have benchmarked the 26b and 31b qat as well on a single RTX 4090, checkout the comments for details. If you have a 6GB or 8GB VRAM GPU, post your numbers. more benchmarks and configs coming soon

Alok

259,993 views • 24 days ago

Andrew Tate said OpenClaw is a scam. Developers are using it every day. So who’s right? Here’s my take: It’s free. You can plug in free APIs. You can test it yourself in minutes. Is it perfect? No. It breaks. A lot. That’s why I personally prefer Claude Code. It just doesn’t crash on me daily. But don’t take my word for it. Test both. Make up your own mind.

Andrew Tate said OpenClaw is a scam. Developers are using it every day. So who’s right? Here’s my take: It’s free. You can plug in free APIs. You can test it yourself in minutes. Is it perfect? No. It breaks. A lot. That’s why I personally prefer Claude Code. It just doesn’t crash on me daily. But don’t take my word for it. Test both. Make up your own mind.

Julian Goldie SEO

10,316 views • 4 months ago

have you played with Gemma 3-12B-IT on Hugging Face Spaces yet? 😏 you can also pull this and run locally if you feel like it 🤝 here's me asking Gemma 3 for styling tips (interleaved inference) 👒

have you played with Gemma 3-12B-IT on Hugging Face Spaces yet? 😏 you can also pull this and run locally if you feel like it 🤝 here's me asking Gemma 3 for styling tips (interleaved inference) 👒

merve

17,275 views • 1 year ago

now doing gemma 4 moe. it'a crazy how usable this is. just a little web search tool (wiggles its head on tool calls), and it can serve as a mostly capable, local voice assistant. tomorrow i shall replace google auto crap with this in my car. just need to give it access to spotify.

now doing gemma 4 moe. it'a crazy how usable this is. just a little web search tool (wiggles its head on tool calls), and it can serve as a mostly capable, local voice assistant. tomorrow i shall replace google auto crap with this in my car. just need to give it access to spotify.

Mario Zechner

27,265 views • 1 month ago

Stable Diffusion generates beautiful images, but can it be used for open-world recognition? Try Demo! Our #CVPR2023 paper shows that the pre-trained diffusion model indeed is a good image parser, allows for open-vocabulary segmentation and detection.

Stable Diffusion generates beautiful images, but can it be used for open-world recognition? Try Demo! Our #CVPR2023 paper shows that the pre-trained diffusion model indeed is a good image parser, allows for open-vocabulary segmentation and detection.

Xiaolong Wang

241,225 views • 3 years ago

Just found the dLLM library to create Diffusion Language Models It's still early but it's insanely fun to experiment with diffusion (training, inference, eval) dLLM has the potential of becoming the main library for diffusion LLMs

Just found the dLLM library to create Diffusion Language Models It's still early but it's insanely fun to experiment with diffusion (training, inference, eval) dLLM has the potential of becoming the main library for diffusion LLMs

Maxime Labonne

75,421 views • 7 months ago