Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Gemma 4 Diffusion landed in vLLM last week. Day 0. First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel. Result: 1,000+ tokens per second at batch size 1 on a single H100.... show more

Red Hat AI

11,135 subscribers

17,524 просмотров • 8 дней назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Gemma 4 running locally on a Nintendo Switch :) 1.5 tokens per second haha, but it runs! Google Gemma Google AI Developers Google DeepMind

Gemma 4 running locally on a Nintendo Switch :) 1.5 tokens per second haha, but it runs! Google Gemma Google AI Developers Google DeepMind

Maddie D. Reese

188,407 просмотров • 2 месяцев назад

introducing simple-llm: a ~950 line, powerful & extensible inference engine that performs on par with vllm. enjoy :) performance (gpt-oss-120b, on an h100): - batch=1: 135 tok/s (vllm: 138) - batch=64: 4,041 tok/s (vllm: 3,846)

introducing simple-llm: a ~950 line, powerful & extensible inference engine that performs on par with vllm. enjoy :) performance (gpt-oss-120b, on an h100): - batch=1: 135 tok/s (vllm: 138) - batch=64: 4,041 tok/s (vllm: 3,846)

naklecha

59,730 просмотров • 5 месяцев назад

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

Red Hat AI

34,136 просмотров • 2 месяцев назад

Today's autoregressive models generate one token at a time. Mercury 2 generates tokens in parallel. Over 1,000 tok/sec on standard GPUs, at comparable quality to speed-optimized models. Since launch, the community has been showing what diffusion LLMs can unlock. Thanks to the team at Clyep for the breakdown.

Today's autoregressive models generate one token at a time. Mercury 2 generates tokens in parallel. Over 1,000 tok/sec on standard GPUs, at comparable quality to speed-optimized models. Since launch, the community has been showing what diffusion LLMs can unlock. Thanks to the team at Clyep for the breakdown.

Inception

21,104 просмотров • 1 месяц назад

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Red Hat AI

15,902 просмотров • 19 дней назад

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

157,137 просмотров • 1 месяц назад

got gemma 4 31B with MTP running on my DGX Spark. Hermes Agent did most of the legwork. baseline vs MTP on GB10: • c=1: 3.65 → 6.37 tok/s (1.74x) • c=4: 14.34 → 23.59 tok/s (1.65x) • c=8: 14.37 → 24.18 tok/s (1.68x) google says "up to 2x" — we're not quite there but it's real, not vapor. stack: DGX Spark / GB10 + gemma-4-31b-it + gemma-4-31b-it-assistant (MTP drafter) + vLLM built from PR 41745 MTP is basically a lightweight draft model that predicts multiple tokens while the big model verifies them all at once. smaller model does the busywork, bigger model just says yes/no. simple idea, weird to implement. next: tune the draft block size and see if we can push past 2x. also want to try it with Hermes Agent feeding prompts end to end. p.s: this was all done from telegram. Google DeepMind NVIDIA AI Developer

got gemma 4 31B with MTP running on my DGX Spark. Hermes Agent did most of the legwork. baseline vs MTP on GB10: • c=1: 3.65 → 6.37 tok/s (1.74x) • c=4: 14.34 → 23.59 tok/s (1.65x) • c=8: 14.37 → 24.18 tok/s (1.68x) google says "up to 2x" — we're not quite there but it's real, not vapor. stack: DGX Spark / GB10 + gemma-4-31b-it + gemma-4-31b-it-assistant (MTP drafter) + vLLM built from PR 41745 MTP is basically a lightweight draft model that predicts multiple tokens while the big model verifies them all at once. smaller model does the busywork, bigger model just says yes/no. simple idea, weird to implement. next: tune the draft block size and see if we can push past 2x. also want to try it with Hermes Agent feeding prompts end to end. p.s: this was all done from telegram. Google DeepMind NVIDIA AI Developer

Joey

22,855 просмотров • 1 месяц назад

🎉 Day-0 support for in vLLM, available today in v0.23.0! Congrats to Z.ai on GLM-5.2, a flagship model built for long-horizon coding agents. ✨ 1M-token context, built to hold project-scale engineering work in a single run ✨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization ✨ One task can carry a full dev workflow, from requirements to a deployable product across platforms ✨ Client-side and mobile engineering, including an on-device debugging loop Try it out running it on vLLM today: 🔗

🎉 Day-0 support for in vLLM, available today in v0.23.0! Congrats to Z.ai on GLM-5.2, a flagship model built for long-horizon coding agents. ✨ 1M-token context, built to hold project-scale engineering work in a single run ✨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization ✨ One task can carry a full dev workflow, from requirements to a deployable product across platforms ✨ Client-side and mobile engineering, including an on-device debugging loop Try it out running it on vLLM today: 🔗

vLLM

34,952 просмотров • 7 дней назад

Can Google Gemma DiffusionGemma help fix broken OCR? In theory, denoising tokens in parallel could work better for OCR correction since context is seen upfront? Pointed it at 19th-century newspaper OCR. It corrected better than the autoregressive baseline — at ~8x the speed.

Can Google Gemma DiffusionGemma help fix broken OCR? In theory, denoising tokens in parallel could work better for OCR correction since context is seen upfront? Pointed it at 19th-century newspaper OCR. It corrected better than the autoregressive baseline — at ~8x the speed.

Daniel van Strien

36,105 просмотров • 12 дней назад

🚨In our NeurIPS paper, we bring encoder-decoders back.. for diffusion language models! ⚡️Encoder-decoders make diffusion sampling fast: a small (fast) decoder denoises tokens progressively and a large (slower) encoder represents clean context.

🚨In our NeurIPS paper, we bring encoder-decoders back.. for diffusion language models! ⚡️Encoder-decoders make diffusion sampling fast: a small (fast) decoder denoises tokens progressively and a large (slower) encoder represents clean context.

Marianne Arriola

32,003 просмотров • 7 месяцев назад

This video is at normal speed. Gemma 4 12B MLX version running locally at 50 tokens/sec. Thank you Google DeepMind team. This model feels really solid for a lot of small local tasks and everyday AI workflows.

This video is at normal speed. Gemma 4 12B MLX version running locally at 50 tokens/sec. Thank you Google DeepMind team. This model feels really solid for a lot of small local tasks and everyday AI workflows.

AshutoshShrivastava

19,782 просмотров • 19 дней назад

Qwen3.5-35B-A3B running locally on an M4 chip at 49.5 tokens per second. A 35B model. On a laptop. In real time. LOCAL AI IS GETTING SCARY FAST.

Qwen3.5-35B-A3B running locally on an M4 chip at 49.5 tokens per second. A 35B model. On a laptop. In real time. LOCAL AI IS GETTING SCARY FAST.

0xMarioNawfal

477,891 просмотров • 3 месяцев назад

🚨 One orchestrator. 10 parallel agents. 100+ tokens a second. All local. The Google Gemma team just dropped a MASSIVE demo for Gemma 4 26B. They built a concurrent workflow that lets the 26B model coordinate an entire team of sub-agents on your machine. Out of the box, the cookbook lets you run 10 parallel agents to: → Code an entire SVG art gallery in seconds → Translate text simultaneously → Generate ASCII art → Write parallel code Spinning up multi-agent systems locally has never looked this fast or this accessible. 100% free and open-source. repo link in 🧵↓

🚨 One orchestrator. 10 parallel agents. 100+ tokens a second. All local. The Google Gemma team just dropped a MASSIVE demo for Gemma 4 26B. They built a concurrent workflow that lets the 26B model coordinate an entire team of sub-agents on your machine. Out of the box, the cookbook lets you run 10 parallel agents to: → Code an entire SVG art gallery in seconds → Translate text simultaneously → Generate ASCII art → Write parallel code Spinning up multi-agent systems locally has never looked this fast or this accessible. 100% free and open-source. repo link in 🧵↓

Charly Wargnier

27,691 просмотров • 5 дней назад

$Auto regressive LLMs are officially on notice. run Gemma 4 26B diffusion gguf with llama.cpp Google just dropped DiffusionGemma-26B, and it completely flips how we generate text. instead of predicting words one by one, it generates 256 tokens in parallel using bi-directional attention. its like stable diffusion, but for language. the model starts with random text "noise" and iteratively refines and self-corrects the entire block in real-time to fix formatting and reasoning errors on the fly. since it’s a Mixture of Experts (MoE) that only activates 3.8B parameters during inference, it fits perfectly on consumer hardware. You can run the Q4_K_M quant with an 18GB VRAM budget on a single RTX 3090 or RTX 4090 with exceptional throughput. Tested on Ubuntu 22 with CUDA 13.1 using the cutting edge experimental llama.cpp branch. Here is how to compile and run it with the live terminal denoising visualizer: # 1. Clone & check out the experimental PR (#24423) - 1) git clone && cd llama.cpp -git fetch origin 2) pull/24423/head:diffusiongemma && --git checkout diffusiongemma # 2. Build with CUDA support 1) cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native 2) cmake --build build -j $(nproc) --config Release --target llama-diffusion-cli # 3. Run with live visual denoising (llama.cpp flags) ./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 --diffusion-visual Watch the video below to see the live --diffusion-visual canvas iteratively de noising the prompt output in real time. guide and unsloth's hugging face GGUF model links are in the comments below! Is auto regressive generation officially legacy tech? Let me know what you think.$

Auto regressive LLMs are officially on notice. run Gemma 4 26B diffusion gguf with llama.cpp Google just dropped DiffusionGemma-26B, and it completely flips how we generate text. instead of predicting words one by one, it generates 256 tokens in parallel using bi-directional attention. its like stable diffusion, but for language. the model starts with random text "noise" and iteratively refines and self-corrects the entire block in real-time to fix formatting and reasoning errors on the fly. since it’s a Mixture of Experts (MoE) that only activates 3.8B parameters during inference, it fits perfectly on consumer hardware. You can run the Q4_K_M quant with an 18GB VRAM budget on a single RTX 3090 or RTX 4090 with exceptional throughput. Tested on Ubuntu 22 with CUDA 13.1 using the cutting edge experimental llama.cpp branch. Here is how to compile and run it with the live terminal denoising visualizer: # 1. Clone & check out the experimental PR (#24423) - 1) git clone && cd llama.cpp -git fetch origin 2) pull/24423/head:diffusiongemma && --git checkout diffusiongemma # 2. Build with CUDA support 1) cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native 2) cmake --build build -j $(nproc) --config Release --target llama-diffusion-cli # 3. Run with live visual denoising (llama.cpp flags) ./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 --diffusion-visual Watch the video below to see the live --diffusion-visual canvas iteratively de noising the prompt output in real time. guide and unsloth's hugging face GGUF model links are in the comments below! Is auto regressive generation officially legacy tech? Let me know what you think.

Alok

52,656 просмотров • 12 дней назад

Diffusion Gemma is 4x faster, but makes 6x more mistakes! We benchmarked the new diffusion LLM against its autoregressive twin on a single H100 (FP8). We gave each the same three tasks: write a Steve Jobs biography, the history of Tetris, and the story of BeOS - every next topic less popular than the previous one. Then we fact-checked every claim in every answer. Gemma4 got 45 facts right, 5 wrong. DiffusionGemma got 33 right, 28 wrong. The less popular the topic, the worse it got: 4 mistakes on Jobs, 12 on Tetris, 12 on BeOS. It named Clara Clley as Steve Jobs' mother, invented a colleague for Pajitnov named Geri Gulovik and priced the BeBox at $9,999. The real one cost $1,600. Outputs: Gemma4 26B A4B: 218 tok/s · 15.1s total · 45 facts · 5 mistakes DiffusionGemma 26B A4B: 763 tok/s · 3.7s total · 33 facts · 28 mistakes The reason is simple. DiffusionGemma throws 256 tokens on the screen at once and polishes them pass after pass until the text sounds smooth. Smooth is all it cares about: a fake name, date or number sounds just as smooth as a real one, so it stays. Regular Gemma4 meanwhile writes one word at a time and checks every new word against everything before it. Google says it themselves in the launch post: quality is lower, use regular Gemma 4 when facts matter.

Diffusion Gemma is 4x faster, but makes 6x more mistakes! We benchmarked the new diffusion LLM against its autoregressive twin on a single H100 (FP8). We gave each the same three tasks: write a Steve Jobs biography, the history of Tetris, and the story of BeOS - every next topic less popular than the previous one. Then we fact-checked every claim in every answer. Gemma4 got 45 facts right, 5 wrong. DiffusionGemma got 33 right, 28 wrong. The less popular the topic, the worse it got: 4 mistakes on Jobs, 12 on Tetris, 12 on BeOS. It named Clara Clley as Steve Jobs' mother, invented a colleague for Pajitnov named Geri Gulovik and priced the BeBox at $9,999. The real one cost $1,600. Outputs: Gemma4 26B A4B: 218 tok/s · 15.1s total · 45 facts · 5 mistakes DiffusionGemma 26B A4B: 763 tok/s · 3.7s total · 33 facts · 28 mistakes The reason is simple. DiffusionGemma throws 256 tokens on the screen at once and polishes them pass after pass until the text sounds smooth. Smooth is all it cares about: a fake name, date or number sounds just as smooth as a real one, so it stays. Regular Gemma4 meanwhile writes one word at a time and checks every new word against everything before it. Google says it themselves in the launch post: quality is lower, use regular Gemma 4 when facts matter.

atomic.chat

74,917 просмотров • 11 дней назад

Laguna XS.2 from Poolside is a 33B MoE built for agentic coding. Red Hat AI trained a DFlash speculator for it: 0.6B drafter, 8 tokens per pass, no quality loss. FP8, NVFP4, and INT4 checkpoints via LLM Compressor. Models in comments. Speedup with vLLM:

Laguna XS.2 from Poolside is a 33B MoE built for agentic coding. Red Hat AI trained a DFlash speculator for it: 0.6B drafter, 8 tokens per pass, no quality loss. FP8, NVFP4, and INT4 checkpoints via LLM Compressor. Models in comments. Speedup with vLLM:

Red Hat AI

20,828 просмотров • 24 дней назад

THIS DEVELOPER CONNECTED 8 NVIDIA DGX SPARKS INTO ONE CLUSTER - AND RAN AN 800GB MODEL THAT MADE HIM 10X MORE PRODUCTIVE 21:47 he says it straight - "this is a terabyte of VRAM - we ran Quen 3.5, 800GB on disk, a model that doesn't even fit on a single Mac Studio - 24 tokens per second - I'd say that's a win" 8 Sparks connected through a $1,300 switch via RDMA over Ethernet - each node adding 128GB of memory into one unified pool of 1TB started with one Spark at 3 tokens per second - every added node doubled the speed - and eight together deliver 24 tokens on a model that physically cannot run anywhere else Kimi K2 at 600GB loaded in 15 minutes, 115GB per node, 13 tokens per second - a model that simply cannot run on anything smaller Claude helped configure the entire cluster - SSH mesh across all 8 machines, network config, jumbo frames, QSFP port speeds - all from one terminal most people rent cloud compute for models this size at $2,000+/month - he built the cluster once and now every token costs 20x less

THIS DEVELOPER CONNECTED 8 NVIDIA DGX SPARKS INTO ONE CLUSTER - AND RAN AN 800GB MODEL THAT MADE HIM 10X MORE PRODUCTIVE 21:47 he says it straight - "this is a terabyte of VRAM - we ran Quen 3.5, 800GB on disk, a model that doesn't even fit on a single Mac Studio - 24 tokens per second - I'd say that's a win" 8 Sparks connected through a $1,300 switch via RDMA over Ethernet - each node adding 128GB of memory into one unified pool of 1TB started with one Spark at 3 tokens per second - every added node doubled the speed - and eight together deliver 24 tokens on a model that physically cannot run anywhere else Kimi K2 at 600GB loaded in 15 minutes, 115GB per node, 13 tokens per second - a model that simply cannot run on anything smaller Claude helped configure the entire cluster - SSH mesh across all 8 machines, network config, jumbo frames, QSFP port speeds - all from one terminal most people rent cloud compute for models this size at $2,000+/month - he built the cluster once and now every token costs 20x less

Noisy

174,256 просмотров • 3 дней назад

Excited to share what my team has been working on lately - Gemini diffusion! We bring diffusion to language modeling, yielding more power and blazing speeds! 🚀🚀🚀 Gemini diffusion is especially strong at coding. In this example the model generates at 2000 tokens/sec, including overheads like tokenization, prefill, safety filters etc.

Excited to share what my team has been working on lately - Gemini diffusion! We bring diffusion to language modeling, yielding more power and blazing speeds! 🚀🚀🚀 Gemini diffusion is especially strong at coding. In this example the model generates at 2000 tokens/sec, including overheads like tokenization, prefill, safety filters etc.

Brendan O'Donoghue

577,534 просмотров • 1 год назад

What does it take to run 3, 5, or even 10 concurrent instances of Gemma 4 locally? We've open-sourced a demo letting you run multiple models side-by-side on your hardware. Gemma 4 26B A4B easily runs 10+ concurrent requests on a MacBook Pro M4 Max at 18 tokens/sec per request.

What does it take to run 3, 5, or even 10 concurrent instances of Gemma 4 locally? We've open-sourced a demo letting you run multiple models side-by-side on your hardware. Gemma 4 26B A4B easily runs 10+ concurrent requests on a MacBook Pro M4 Max at 18 tokens/sec per request.

Google Gemma

912,070 просмотров • 2 месяцев назад

to run gemma 4 on your phone without any login or internet: - download Google AI Edge Gallery - select Gemma 4 E2B/E4B - thats it I wish they give support to add other edge models in future like locallyAI here is a video of gemma 4 E4B in action at 13 tps on iphone 15 pro max

to run gemma 4 on your phone without any login or internet: - download Google AI Edge Gallery - select Gemma 4 E2B/E4B - thats it I wish they give support to add other edge models in future like locallyAI here is a video of gemma 4 E4B in action at 13 tps on iphone 15 pro max

Harveen Singh Chadha

13,535 просмотров • 2 месяцев назад