Red Hat AI's banner

Red Hat AI

@RedHat_AI • 11,373 subscribers

Accelerating AI innovation with open platforms and community. The future of AI is open.

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Gemma 4 Diffusion landed in vLLM last week. Day 0. First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel. Result: 1,000+ tokens per second at batch size 1 on a single H100. Built on Model Runner V2. Google Gemma

Gemma 4 Diffusion landed in vLLM last week. Day 0. First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel. Result: 1,000+ tokens per second at batch size 1 on a single H100. Built on Model Runner V2. Google Gemma

17,637 Aufrufe • vor 1 Monat

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

34,199 Aufrufe • vor 3 Monaten

Laguna XS.2 from Poolside is a 33B MoE built for agentic coding. Red Hat AI trained a DFlash speculator for it: 0.6B drafter, 8 tokens per pass, no quality loss. FP8, NVFP4, and INT4 checkpoints via LLM Compressor. Models in comments. Speedup with vLLM:

Laguna XS.2 from Poolside is a 33B MoE built for agentic coding. Red Hat AI trained a DFlash speculator for it: 0.6B drafter, 8 tokens per pass, no quality loss. FP8, NVFP4, and INT4 checkpoints via LLM Compressor. Models in comments. Speedup with vLLM:

21,221 Aufrufe • vor 2 Monaten

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

15,968 Aufrufe • vor 1 Monat

Michael Goin (Michael Goin) walks through what's new in vLLM v0.17, v0.18, and v0.19 in ~8 minutes. Flash Attention 4, new performance modes, zero-bubble async scheduling, online MXFP4 quantization, Gemma 4, and a lot more. 1,592 commits. 682 contributors (163 new). 🎉 🚀

Michael Goin (Michael Goin) walks through what's new in vLLM v0.17, v0.18, and v0.19 in ~8 minutes. Flash Attention 4, new performance modes, zero-bubble async scheduling, online MXFP4 quantization, Gemma 4, and a lot more. 1,592 commits. 682 contributors (163 new). 🎉 🚀

23,115 Aufrufe • vor 3 Monaten

InferenceMAX, vLLM TPU, compressed-tensors, MoE support via transformers, DeepSeek-OCR, and more. Here’s what’s new in the vLLM community over the past two weeks:

InferenceMAX, vLLM TPU, compressed-tensors, MoE support via transformers, DeepSeek-OCR, and more. Here’s what’s new in the vLLM community over the past two weeks:

24,429 Aufrufe • vor 9 Monaten

A full year of vLLM in 30 minutes by vLLM Lead from UC Berkeley, Simon Mo. Model and hardware usage trends, model architectures, API evolution, V1 engine rebuild, multimodal progress, expanding hardware support, and more. Plus how we are thinking about 2026. Enjoy!

A full year of vLLM in 30 minutes by vLLM Lead from UC Berkeley, Simon Mo. Model and hardware usage trends, model architectures, API evolution, V1 engine rebuild, multimodal progress, expanding hardware support, and more. Plus how we are thinking about 2026. Enjoy!

15,713 Aufrufe • vor 7 Monaten

Keine weiteren Inhalte verfügbar