Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

A $351 MINI PC IS RUNNING 26-BILLION-PARAMETER AI MODELS AT 20 TOKENS/SEC AND HERMES AGENT ON TOP OF IT This is the Minisforum UM790 Pro. AMD Ryzen 9 7940HS, Radeon 780M iGPU, 48GB DDR5. The BIOS reports the GPU has 4GB of VRAM Here's the part people get wrong....

35,742 görüntüleme • 15 gün önce •via X (Twitter)

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

a new 8GB VRAM GPU dense Local LLM leader was born yesterday runs on: RTX 4060 / RTX 3070 / RTX 2080. any 8GB card Qwen 3.5 9B (dense) was the go to for 6-8GB VRAM builds. Gemma 4 12B QAT (dense) just changed that. same llama.cpp + cuda 13.2. i7 12700H. 16GB RAM. same -ngl 99 flags. same 48k context. unsloth gemma-4-12b-it-Q4_K_M.gguf → 15 tok/sec @ 48k ctx unsloth gemma-4-12B-it-qat-UD-Q4_K_XL.gguf → 32 tok/sec @ 48k ctx → 26 tok/sec @ 64k ctx 64k context is a big deal. Hermes 3 agent requires 64k minimum to run. you're now getting full hermes compatible context on a budget consumer GPU at 26 tok/sec locally. 2.1x faster on identical hardware. and here's the part that breaks your brain: the QAT-UD-Q4_K_XL is actually SMALLER than the Q4_K_M "XL" why? QAT = Quantization Aware Training Google didn't train the model first and compress it later they trained it to be quantized from day one the weights already know how to survive low precision that's why you get more quality per byte llamacpp flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 48000 -v fits in 8GB VRAM clean. no API. no cloud. no subscription. and this isn't even the MTP variant yet Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB. I have benchmarked the 26b and 31b qat as well on a single RTX 4090, checkout the comments for details. If you have a 6GB or 8GB VRAM GPU, post your numbers. more benchmarks and configs coming soon

Alok

259,993 görüntüleme • 29 gün önce

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

Alok

59,908 görüntüleme • 13 gün önce

I'm running Llama 4 Maverick at 620 t/s! I'm living in the future! Honestly, a large language model running this fast is something straight out of a sci-fi movie. Speeds like this will enable a whole new world of applications that aren't possible today. For reference, GPT-4o, which is probably the most popular OpenAI model, runs between 60 and 110 t/s. The secret here: I'm not running AI at Meta's Llama 4 Maverick on a GPU. I'm using the SambaNova Cloud (my sponsor) and their custom SN40L chips. They are optimized from the ground up for running AI workflows. Right now, SambaNova Cloud runs DeepSeek, Qwen, Whisper, and the entire family of Llama models on these chips. You can check the speed of each of these models using SambaNova Cloud's Playground (see the attached video). It's completely free, and that's how I'm measuring their speeds. For example, I also tried DeepSeek R1 (the latest version from May) and, oh boy! DeepSeek R1 is a huge 671B parameter model. It's probably the best open reasoning model in the world, and it runs at 140 tokens per second! !!! Inference time on an SN40L is night and day from what you'll get from a GPU. Here is why this is big: If you are running an agentic workflow that uses multiple models simultaneously on a GPU, it will need to swap models in and out of memory (because not every model fits). A single SNL40 chip can simultaneously hold over 100 models (trillions of parameters) in memory. If you are using open models, try the SambaCloud API to see what lightning speed looks like. Here is how: 1. Create a free account at: 2. Check the QuickStart guide: If you try the playground, check the speed you're getting with Llama 4 and DeepSeek, and post the results below. I've seen much higher numbers than I posted here, so I'm curious to see whether geography affects the speed.

Santiago

34,148 görüntüleme • 1 yıl önce