Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

Red Hat AI

11,136 subscribers

34,136 Aufrufe • vor 2 Monaten •via X (Twitter)

Bildung Nachrichten & Politik Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Laguna XS.2 from Poolside is a 33B MoE built for agentic coding. Red Hat AI trained a DFlash speculator for it: 0.6B drafter, 8 tokens per pass, no quality loss. FP8, NVFP4, and INT4 checkpoints via LLM Compressor. Models in comments. Speedup with vLLM:

Laguna XS.2 from Poolside is a 33B MoE built for agentic coding. Red Hat AI trained a DFlash speculator for it: 0.6B drafter, 8 tokens per pass, no quality loss. FP8, NVFP4, and INT4 checkpoints via LLM Compressor. Models in comments. Speedup with vLLM:

Red Hat AI

20,828 Aufrufe • vor 1 Monat

Gemma 4 Diffusion landed in vLLM last week. Day 0. First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel. Result: 1,000+ tokens per second at batch size 1 on a single H100. Built on Model Runner V2. Google Gemma

Gemma 4 Diffusion landed in vLLM last week. Day 0. First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel. Result: 1,000+ tokens per second at batch size 1 on a single H100. Built on Model Runner V2. Google Gemma

Red Hat AI

17,524 Aufrufe • vor 15 Tagen

Running Qwen3 8B thinking on an iPhone Air with MLX. The model is quantized to 4-bit and runs pretty well.

Running Qwen3 8B thinking on an iPhone Air with MLX. The model is quantized to 4-bit and runs pretty well.

Awni Hannun

215,529 Aufrufe • vor 9 Monaten

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

Awni Hannun

186,641 Aufrufe • vor 11 Monaten

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show that quantization helps exploration in RL training. Paper: Code: #NVIDIA #AIResearch #ReinforcementLearning #Quantization #LLM #EfficientAI

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show that quantization helps exploration in RL training. Paper: Code: #NVIDIA #AIResearch #ReinforcementLearning #Quantization #LLM #EfficientAI

Yukang Chen

69,747 Aufrufe • vor 8 Monaten

What does it take to run 3, 5, or even 10 concurrent instances of Gemma 4 locally? We've open-sourced a demo letting you run multiple models side-by-side on your hardware. Gemma 4 26B A4B easily runs 10+ concurrent requests on a MacBook Pro M4 Max at 18 tokens/sec per request.

What does it take to run 3, 5, or even 10 concurrent instances of Gemma 4 locally? We've open-sourced a demo letting you run multiple models side-by-side on your hardware. Gemma 4 26B A4B easily runs 10+ concurrent requests on a MacBook Pro M4 Max at 18 tokens/sec per request.

Google Gemma

912,192 Aufrufe • vor 2 Monaten

We’re thrilled to open-source TriAttention! 🚀 🦞 Deploy OpenClaw (32B LLM) on a single 24GB RTX 4090 locally 💻Full code open-source & vLLM-ready for one-click deployment ⚡️ 2.5× faster inference speed & 10.7× less KV cache memory usage TriAttention is a novel KV cache compression method built on rigorous trigonometric analysis in the Pre‑RoPE space for efficient LLM long reasoning. Github Repo: Paper Link: Homepage:

We’re thrilled to open-source TriAttention! 🚀 🦞 Deploy OpenClaw (32B LLM) on a single 24GB RTX 4090 locally 💻Full code open-source & vLLM-ready for one-click deployment ⚡️ 2.5× faster inference speed & 10.7× less KV cache memory usage TriAttention is a novel KV cache compression method built on rigorous trigonometric analysis in the Pre‑RoPE space for efficient LLM long reasoning. Github Repo: Paper Link: Homepage:

Yukang Chen

197,268 Aufrufe • vor 2 Monaten

GLM 4.6 runs quite fast on an M3 Ultra with mlx-lm even at higher precision. Pretty remarkable that it benchmarks competitive to the just-released Sonnet 4.5. Hope those benchmarks hold-up in day-to-day use. Here's a run using 5.5 bpw quantized model, generating 5.3k tokens at 17+ tok/sec using 244 GB. What prompts should I test?

GLM 4.6 runs quite fast on an M3 Ultra with mlx-lm even at higher precision. Pretty remarkable that it benchmarks competitive to the just-released Sonnet 4.5. Hope those benchmarks hold-up in day-to-day use. Here's a run using 5.5 bpw quantized model, generating 5.3k tokens at 17+ tok/sec using 244 GB. What prompts should I test?

Awni Hannun

68,539 Aufrufe • vor 9 Monaten

a new 8GB VRAM GPU dense Local LLM leader was born yesterday runs on: RTX 4060 / RTX 3070 / RTX 2080. any 8GB card Qwen 3.5 9B (dense) was the go to for 6-8GB VRAM builds. Gemma 4 12B QAT (dense) just changed that. same llama.cpp + cuda 13.2. i7 12700H. 16GB RAM. same -ngl 99 flags. same 48k context. unsloth gemma-4-12b-it-Q4_K_M.gguf → 15 tok/sec @ 48k ctx unsloth gemma-4-12B-it-qat-UD-Q4_K_XL.gguf → 32 tok/sec @ 48k ctx → 26 tok/sec @ 64k ctx 64k context is a big deal. Hermes 3 agent requires 64k minimum to run. you're now getting full hermes compatible context on a budget consumer GPU at 26 tok/sec locally. 2.1x faster on identical hardware. and here's the part that breaks your brain: the QAT-UD-Q4_K_XL is actually SMALLER than the Q4_K_M "XL" why? QAT = Quantization Aware Training Google didn't train the model first and compress it later they trained it to be quantized from day one the weights already know how to survive low precision that's why you get more quality per byte llamacpp flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 48000 -v fits in 8GB VRAM clean. no API. no cloud. no subscription. and this isn't even the MTP variant yet Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB. I have benchmarked the 26b and 31b qat as well on a single RTX 4090, checkout the comments for details. If you have a 6GB or 8GB VRAM GPU, post your numbers. more benchmarks and configs coming soon

a new 8GB VRAM GPU dense Local LLM leader was born yesterday runs on: RTX 4060 / RTX 3070 / RTX 2080. any 8GB card Qwen 3.5 9B (dense) was the go to for 6-8GB VRAM builds. Gemma 4 12B QAT (dense) just changed that. same llama.cpp + cuda 13.2. i7 12700H. 16GB RAM. same -ngl 99 flags. same 48k context. unsloth gemma-4-12b-it-Q4_K_M.gguf → 15 tok/sec @ 48k ctx unsloth gemma-4-12B-it-qat-UD-Q4_K_XL.gguf → 32 tok/sec @ 48k ctx → 26 tok/sec @ 64k ctx 64k context is a big deal. Hermes 3 agent requires 64k minimum to run. you're now getting full hermes compatible context on a budget consumer GPU at 26 tok/sec locally. 2.1x faster on identical hardware. and here's the part that breaks your brain: the QAT-UD-Q4_K_XL is actually SMALLER than the Q4_K_M "XL" why? QAT = Quantization Aware Training Google didn't train the model first and compress it later they trained it to be quantized from day one the weights already know how to survive low precision that's why you get more quality per byte llamacpp flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 48000 -v fits in 8GB VRAM clean. no API. no cloud. no subscription. and this isn't even the MTP variant yet Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB. I have benchmarked the 26b and 31b qat as well on a single RTX 4090, checkout the comments for details. If you have a 6GB or 8GB VRAM GPU, post your numbers. more benchmarks and configs coming soon

Alok

259,993 Aufrufe • vor 24 Tagen

The era of 1-bit LLMs is here — now with WebGPU acceleration! 🤯 It's incredible to think that a quantized 1.7B model (just 290MB in size) can run at ~100 tokens per second entirely in your browser. Try the demo yourself 👇

The era of 1-bit LLMs is here — now with WebGPU acceleration! 🤯 It's incredible to think that a quantized 1.7B model (just 290MB in size) can run at ~100 tokens per second entirely in your browser. Try the demo yourself 👇

Xenova

105,189 Aufrufe • vor 2 Monaten

Quantized Gemma 2B runs pretty fast on my iPhone 15 pro in MLX Swift. code & docs: Comparable to GPT 3.5 turbo and Mixtral 8x7B in LMSYS Org benchmarks but runs efficiently on an iPhone. Pretty wild.

Quantized Gemma 2B runs pretty fast on my iPhone 15 pro in MLX Swift. code & docs: Comparable to GPT 3.5 turbo and Mixtral 8x7B in LMSYS Org benchmarks but runs efficiently on an iPhone. Pretty wild.

Awni Hannun

79,702 Aufrufe • vor 1 Jahr

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Red Hat AI

15,902 Aufrufe • vor 26 Tagen

Running Ring 1T reasoning model on a single M3 Ultra with mlx-lm. It's quantized to 3.5 bits-per-weight. Uses 440GB and generated ~6k tokens at 18.2 toks/sec. Getting closer to GPT-5 at home.

Running Ring 1T reasoning model on a single M3 Ultra with mlx-lm. It's quantized to 3.5 bits-per-weight. Uses 440GB and generated ~6k tokens at 18.2 toks/sec. Getting closer to GPT-5 at home.

Awni Hannun

55,131 Aufrufe • vor 9 Monaten

Gemma 4 31B is way too slow on my MacBook Pro M5 Max with 128GB of unified memory. 42 seconds to generate the first line of code. 21 tokens per second after that. Uses 26GB of memory when loaded. I sat there waiting nearly a minute before it even started writing code. The output? Actually decent. The lava lamp it generated looks good. But 42 seconds to first token is a dealbreaker. Claude Opus 4.7 starts generating in under 2 seconds. GPT 5.5 is instant. Good model. Unusable speed. Local inference on a $5,399 machine still can't match the cloud.

Gemma 4 31B is way too slow on my MacBook Pro M5 Max with 128GB of unified memory. 42 seconds to generate the first line of code. 21 tokens per second after that. Uses 26GB of memory when loaded. I sat there waiting nearly a minute before it even started writing code. The output? Actually decent. The lava lamp it generated looks good. But 42 seconds to first token is a dealbreaker. Claude Opus 4.7 starts generating in under 2 seconds. GPT 5.5 is instant. Good model. Unusable speed. Local inference on a $5,399 machine still can't match the cloud.

BridgeMind

40,042 Aufrufe • vor 1 Monat

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Qwen3-Coder-Flash runs quite fast on an M4 Max with mlx-lm. Running the 4-bit here, generated 4,467 tokens at >107 tokens/sec:

Awni Hannun

196,482 Aufrufe • vor 11 Monaten

Someone built a free and better alternative to Claude that runs 100% locally. → works with any LLM (Claude, GPT, Gemini, vLLM) → beats it on deep research → has Cowork-like capabilities → 50+ connectors out of the box → deploy in literally one command 100% open source. MIT license. 28k stars.

Someone built a free and better alternative to Claude that runs 100% locally. → works with any LLM (Claude, GPT, Gemini, vLLM) → beats it on deep research → has Cowork-like capabilities → 50+ connectors out of the box → deploy in literally one command 100% open source. MIT license. 28k stars.

How To Prompt

39,043 Aufrufe • vor 1 Monat

got gemma 4 31B with MTP running on my DGX Spark. Hermes Agent did most of the legwork. baseline vs MTP on GB10: • c=1: 3.65 → 6.37 tok/s (1.74x) • c=4: 14.34 → 23.59 tok/s (1.65x) • c=8: 14.37 → 24.18 tok/s (1.68x) google says "up to 2x" — we're not quite there but it's real, not vapor. stack: DGX Spark / GB10 + gemma-4-31b-it + gemma-4-31b-it-assistant (MTP drafter) + vLLM built from PR 41745 MTP is basically a lightweight draft model that predicts multiple tokens while the big model verifies them all at once. smaller model does the busywork, bigger model just says yes/no. simple idea, weird to implement. next: tune the draft block size and see if we can push past 2x. also want to try it with Hermes Agent feeding prompts end to end. p.s: this was all done from telegram. Google DeepMind NVIDIA AI Developer

got gemma 4 31B with MTP running on my DGX Spark. Hermes Agent did most of the legwork. baseline vs MTP on GB10: • c=1: 3.65 → 6.37 tok/s (1.74x) • c=4: 14.34 → 23.59 tok/s (1.65x) • c=8: 14.37 → 24.18 tok/s (1.68x) google says "up to 2x" — we're not quite there but it's real, not vapor. stack: DGX Spark / GB10 + gemma-4-31b-it + gemma-4-31b-it-assistant (MTP drafter) + vLLM built from PR 41745 MTP is basically a lightweight draft model that predicts multiple tokens while the big model verifies them all at once. smaller model does the busywork, bigger model just says yes/no. simple idea, weird to implement. next: tune the draft block size and see if we can push past 2x. also want to try it with Hermes Agent feeding prompts end to end. p.s: this was all done from telegram. Google DeepMind NVIDIA AI Developer

Joey

22,855 Aufrufe • vor 1 Monat

this dude created a Gemma 4 that is 6X faster than Google's version! Gemma 4 is notorious for: • being the most optimized free open source local AI • that works with even old laptops and smartphones • and now it's insanely fast 6X Dflash Gemma 4:

this dude created a Gemma 4 that is 6X faster than Google's version! Gemma 4 is notorious for: • being the most optimized free open source local AI • that works with even old laptops and smartphones • and now it's insanely fast 6X Dflash Gemma 4:

Meta Alchemist

277,426 Aufrufe • vor 1 Monat

DFlash speculative decoding on Apple Silicon Qwen3.5-9B bf16 · M5 Max · greedy exact match ▸ 85 tok/s, 3.3× at 1024 tokens (runtime) ▸ ~70 tok/s, 2.6× in the video (terminal I/O overhead) ▸ 80 tok/s, 3.1× at 2048 tokens (runtime) Currently working on: → Long context (speedup degrades past 4K tokens, KV cache growth) → Int4 quantized models (27B class) Built on MLX, no CUDA, single machine. Draft generates 16 tokens in parallel, target verifies in one forward pass. Will open source when ready.

DFlash speculative decoding on Apple Silicon Qwen3.5-9B bf16 · M5 Max · greedy exact match ▸ 85 tok/s, 3.3× at 1024 tokens (runtime) ▸ ~70 tok/s, 2.6× in the video (terminal I/O overhead) ▸ 80 tok/s, 3.1× at 2048 tokens (runtime) Currently working on: → Long context (speedup degrades past 4K tokens, KV cache growth) → Int4 quantized models (27B class) Built on MLX, no CUDA, single machine. Draft generates 16 tokens in parallel, target verifies in one forward pass. Will open source when ready.

bstn 👁️

36,942 Aufrufe • vor 2 Monaten