Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

The leading model runnable on a single cloud H100 GPU now fits on a single home GPU! 🔥 We've optimized Gemma 3 27B with QAT so you can run our best-in-class open model on your desktop RTX 3090 or similar. See how easy it is to try via ollama! 👇

Glenn Cameron Jr

9,909 subscribers

81,427 views • 1 year ago •via X (Twitter)

Gaming Science & Technology Education

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Trained a simple world model for my robot arm. It predicts the future over 20000 times faster than real time on a single NVIDIA RTX 3090 GPU (128 batch -> 160x faster each).

Trained a simple world model for my robot arm. It predicts the future over 20000 times faster than real time on a single NVIDIA RTX 3090 GPU (128 batch -> 160x faster each).

Alexander Koch

39,700 views • 2 years ago

MotionStream Real-Time Video Generation with Interactive Motion Controls model runs in real time on a single NVIDIA H100 GPU (29 FPS, 0.4s Latency)

MotionStream Real-Time Video Generation with Interactive Motion Controls model runs in real time on a single NVIDIA H100 GPU (29 FPS, 0.4s Latency)

AK

22,304 views • 8 months ago

Claude fable 5 is cooked Claude fable 5 Vs Gemma 4 26b a4b qat MTP Gemma 4 26b running locally on my 8GB vram single RTX 4060 built this using three.js in a single session and 3 prompts. no cloud, no subscription 100% private unlimited use. how long until it can catch up completely?

Claude fable 5 is cooked Claude fable 5 Vs Gemma 4 26b a4b qat MTP Gemma 4 26b running locally on my 8GB vram single RTX 4060 built this using three.js in a single session and 3 prompts. no cloud, no subscription 100% private unlimited use. how long until it can catch up completely?

Alok

30,782 views • 24 days ago

I love how open the Codex app is, you can run any model you want, even local. These are the only configs you need to use the app with Gemma 4 via Ollama.

I love how open the Codex app is, you can run any model you want, even local. These are the only configs you need to use the app with Gemma 4 via Ollama.

Pietro Schirano

119,366 views • 1 month ago

it's open source time, with a real leap for world models 🎉 NVIDIA's SANA-WM: a camera-conditioned world model that fits on one GPU. 60s of 720p in 34s on a single 5090 - 2.6B params and Apache 2.0!

it's open source time, with a real leap for world models 🎉 NVIDIA's SANA-WM: a camera-conditioned world model that fits on one GPU. 60s of 720p in 34s on a single 5090 - 2.6B params and Apache 2.0!

Victor M

34,386 views • 1 month ago

In what ways are you using #AI on your RTX GPU in your personal or professional life? 🤔 Reply with your answer and use #AIonRTX for a chance to win a MSI Gaming GeForce RTX 4080 GAMING X TRIO GPU. 👇

In what ways are you using #AI on your RTX GPU in your personal or professional life? 🤔 Reply with your answer and use #AIonRTX for a chance to win a MSI Gaming GeForce RTX 4080 GAMING X TRIO GPU. 👇

NVIDIA AI Developer

49,468 views • 2 years ago

Stanford dropped FramePack This AI can run on 6 GB laptop GPU to generate minute long 30fps video from single image No distillation, open source. 10 wild examples & how to try it: 👇

Stanford dropped FramePack This AI can run on 6 GB laptop GPU to generate minute long 30fps video from single image No distillation, open source. 10 wild examples & how to try it: 👇

Min Choi

633,607 views • 1 year ago

#RTXRemix can leverage AI on your RTX GPU to help remaster classic games! 🎮 What game would you like to see remastered with #RTXRemix? 🤔 Repost & tell us below with #AIonRTX for a chance to win a GeForce RTX 4090 GPU! 👇

#RTXRemix can leverage AI on your RTX GPU to help remaster classic games! 🎮 What game would you like to see remastered with #RTXRemix? 🤔 Repost & tell us below with #AIonRTX for a chance to win a GeForce RTX 4090 GPU! 👇

NVIDIA Studio

59,305 views • 2 years ago

Did you know that AI chatbots like ChatRTX can run on your local PC with a GeForce RTX GPU? 🙌 How would you use ChatRTX to elevate your gaming experinece? Let us know us know below and use #AIonRTX for a chance to WIN an RTX ON Keycap or RTX 4090 poster! 👇

Did you know that AI chatbots like ChatRTX can run on your local PC with a GeForce RTX GPU? 🙌 How would you use ChatRTX to elevate your gaming experinece? Let us know us know below and use #AIonRTX for a chance to WIN an RTX ON Keycap or RTX 4090 poster! 👇

NVIDIA GeForce

66,743 views • 2 years ago

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Akshay 🚀

126,077 views • 7 months ago

i found a way to make UNCENSORED AI AGENT on a RTX 4090 GPU (!!!) with LOCAL 30B model weights this is GLM-4.7-Flash with abliteration, need 24GB VRAM, safety alignment surgically removed from the weights, the model has native tool calling, it actually executes bash, edits files, runs git (1) use ollama to pull weights of GLM > ollama pull huihui_ai/glm-4.7-flash-abliterated:q4_K (2) proxy it to any coding agent via ollama > ollama launch claude --model huihui_ai/glm-4.7-flash-abliterated:q4_K > ollama launch codex --model huihui_ai/glm-4.7-flash-abliterated:q4_K > ollama launch opencode --model huihui_ai/glm-4.7-flash-abliterated:q4_K (3) have fun

i found a way to make UNCENSORED AI AGENT on a RTX 4090 GPU (!!!) with LOCAL 30B model weights this is GLM-4.7-Flash with abliteration, need 24GB VRAM, safety alignment surgically removed from the weights, the model has native tool calling, it actually executes bash, edits files, runs git (1) use ollama to pull weights of GLM > ollama pull huihui_ai/glm-4.7-flash-abliterated:q4_K (2) proxy it to any coding agent via ollama > ollama launch claude --model huihui_ai/glm-4.7-flash-abliterated:q4_K > ollama launch codex --model huihui_ai/glm-4.7-flash-abliterated:q4_K > ollama launch opencode --model huihui_ai/glm-4.7-flash-abliterated:q4_K (3) have fun

chiefofautism

340,975 views • 4 months ago

NVIDIA Nemotron 3 Nano Omni, a new multimodal reasoning model, is now live on Jetson AI Lab and unifies vision, audio, and language into a single reasoning loop. 🙌 Power your NemoClaws by running this model with Ollama, vLLM and other inference frameworks on NVIDIA Jetson hardware. Try it ➡️

NVIDIA Nemotron 3 Nano Omni, a new multimodal reasoning model, is now live on Jetson AI Lab and unifies vision, audio, and language into a single reasoning loop. 🙌 Power your NemoClaws by running this model with Ollama, vLLM and other inference frameworks on NVIDIA Jetson hardware. Try it ➡️

NVIDIA Robotics

16,031 views • 2 months ago

BOOM! STANFORD LAUNCHES FRAMEPACK A FREE OPEN SOURCE AI THAT CAN RUN ON 6 GB LAPTOP GPU TO GENERATE MINUTE LONG 30FPS VIDEO FROM SINGLE IMAGE. It is game changing…

BOOM! STANFORD LAUNCHES FRAMEPACK A FREE OPEN SOURCE AI THAT CAN RUN ON 6 GB LAPTOP GPU TO GENERATE MINUTE LONG 30FPS VIDEO FROM SINGLE IMAGE. It is game changing…

Brian Roemmele

534,742 views • 1 year ago

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Alok

291,095 views • 27 days ago

BlackBird now runs on 8GB RAM Macs. No GPU. No cloud. Just fast, private AI agents - right on your MacBook Air. We optimized memory, speed, and thermal performance so anyone can build with AI. Try it: Next Stop: Windows Beta Drops This Week! DM Me if you want to try it. #OnDeviceAI #BlackBird #AIforEveryone #macOS

BlackBird now runs on 8GB RAM Macs. No GPU. No cloud. Just fast, private AI agents - right on your MacBook Air. We optimized memory, speed, and thermal performance so anyone can build with AI. Try it: Next Stop: Windows Beta Drops This Week! DM Me if you want to try it. #OnDeviceAI #BlackBird #AIforEveryone #macOS

Hina Dixit

1,233,525 views • 1 year ago

my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy. but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies. so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine? # Hardware: GPU: NVIDIA RTX 4060, 8 GB VRAM RAM: 16 GB CPU: Intel Core i7 H Laptop. Gaming. Modest. The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf (model's unsloth huggingface link in the comments) This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded. # the flags I used: -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup. # Results: → Decode: ~3 tokens/sec → Prefill: ~2 tokens/sec → Context: 6000 tokens → Hardware crying quietly in the corner: yes so is 3 tps actually usable? For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps. but slow ≠ useless. And this is where it gets genuinely interesting. think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior. That's exactly the local AI agent architecture this unlocks: → Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev. → Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus) other workflows where 3 tps is completely fine: - overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results. - One shot deep reasoning - Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints - Any workflow where output quality > output speed A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting. Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping. the tools are here. the models are here. you just have to be willing to abuse your laptop a little. what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.

my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy. but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies. so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine? # Hardware: GPU: NVIDIA RTX 4060, 8 GB VRAM RAM: 16 GB CPU: Intel Core i7 H Laptop. Gaming. Modest. The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf (model's unsloth huggingface link in the comments) This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded. # the flags I used: -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup. # Results: → Decode: ~3 tokens/sec → Prefill: ~2 tokens/sec → Context: 6000 tokens → Hardware crying quietly in the corner: yes so is 3 tps actually usable? For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps. but slow ≠ useless. And this is where it gets genuinely interesting. think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior. That's exactly the local AI agent architecture this unlocks: → Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev. → Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus) other workflows where 3 tps is completely fine: - overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results. - One shot deep reasoning - Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints - Any workflow where output quality > output speed A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting. Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping. the tools are here. the models are here. you just have to be willing to abuse your laptop a little. what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.

Alok

63,095 views • 17 days ago

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Alok

200,913 views • 26 days ago

Ready to prove Ethereum in seconds on a single GPU? Introducing ZKsync Airbender: The world’s fastest open-source RISC-V zkVM ⚡️

Ready to prove Ethereum in seconds on a single GPU? Introducing ZKsync Airbender: The world’s fastest open-source RISC-V zkVM ⚡️

ZKsync

716,505 views • 1 year ago

ServiceNow-AI/Apriel-1.5-15b-Thinker running on a single GPU using `transformers serve` 🔥 great to have some very nice reasoning models that can run locally! next step, trying it on mps 👀

ServiceNow-AI/Apriel-1.5-15b-Thinker running on a single GPU using `transformers serve` 🔥 great to have some very nice reasoning models that can run locally! next step, trying it on mps 👀

Lysandre

14,770 views • 9 months ago