Alok

@analogalok • 2,683 subscribers

Mechatronics Engineer AI belongs on your device. • Offline inference • No subscriptions. Teaching you to own your AI Intelligence Stack

Shorts

This is the most hilarious thing I saw and did today Ran gemma-4-12B-coder-fable5-composer2.5-v1-GGUF locally with 8 GB VRAM at 20+ tok/sec Anthropic's Claude Fable 5 launched June 9. By June 12 it was banned. I can't access it. You can't either. But here's the twist: I'm running a model trained on its chain of thought at 20 tok/s on my RTX 4060 8GB. Locally. Offline. No cloud. No export control. Enter: Gemma4-12B-Coder GGUF (Q4_K_M) Base: Google's gemma-4-12B-it Fine-tuned on verifiable Python CoT data: - Primary: Composer 2.5 real reasoning traces (only passing solutions kept) - Auxiliary: Fable 5 used to redo the hard cases Composer missed. Every training example's reasoning led to code that actually ran. No hallucinated logic. Llama.cpp flags: -m gemma4-coding-Q4_K_M.gguf -cnv -ngl 44 -c 64000 -v (huggingface model link in comments) Flag breakdown: -ngl 44 → offload 44 layers to GPU (tune this for your VRAM) -c 64000 → 64K context window -cnv → conversation/chat mode -v → verbose output The irony writes itself. Anthropic spent weeks telling the world Fable 5 (mythos) is too powerful to release. Then released it. Then got banned from serving it, including their own researchers. Meanwhile: a Gemma 4 12B fine tune, trained on Fable 5's reasoning, runs fully offline on my mid range consumer GPU No API. No cloud. Just me and llama.cpp. This is why local AI matters. Check out the model's link in the comments. How's your experience been with this model?

571,707 просмотров

I told you to claim your free 16GB NVIDIA GPU for learning Local LLMs. Now I’m going to show you how to double its inference speed without touching the hardware. Google Colab gives you an enterprise grade NVIDIA Tesla T4 GPU for free, roughly 4 hours every single day. It is the absolute perfect sandbox for learning AI engineering, testing inference flags, and pushing massive context windows. The local AI timeline is moving way too fast. If you aren't using Multi Token Prediction (MTP) yet, you are leaving massive performance on the table. I just pushed DeepMind’s Gemma 4 26B to 64.9 t/s on this exact free tier. Let's look at the raw benchmark data running on an Ubuntu Linux environment with the latest compiled llama.cpp binaries and quantized GGUFs from Unsloth via HuggingFace: # Qwen 3.5 9B (Dense): Base: [ Prompt: 626.7 t/s | Generation: 21.0 t/s ] With MTP: [ Prompt: 539.1 t/s | Generation: 24.8 t/s ] # Gemma 4 26B QAT (MoE): Base: [ Prompt: 634.2 t/s | Generation: 48.3 t/s ] With MTP: [ Prompt: 572.1 t/s | Generation: 64.9 t/s ] If you are paying attention, this single Colab notebook reveals 3 massive observations about the current state of local LLMs: # 1. The MTP Speedup (Software Overclocking) Standard autoregressive decoding guesses one token at a time. MTP acts like a highly optimized, built in speculative decoder. It predicts multiple future tokens at once and the main model verifies them in parallel. The result? Zero accuracy loss and a massive throughput increase. Gemma jumped from 48 to 65 t/s just by flipping a flag. # 2. The MoE Paradox (Bigger is Faster) How does a 26B parameter model absolutely destroy a 9B model in raw speed on the exact same hardware? Architecture. Qwen 3.5 9B is a dense model. it activates all 9 billion parameters for every single token. Gemma 4 26B is a Mixture of Experts (MoE) model. It routes data efficiently, activating only 4B parameters per token. You get the reasoning capabilities of a 26B model with the compute cost of a 4B model. 3. Thinking Efficiency When I ran the exact same complex prompt on both models, the larger MoE spent significantly fewer "thinking" tokens to arrive at the correct answer. A smarter model doesn't just give better answers; it gets to the point faster, saving you compute cycles and preserving your context window. # Want to run this yourself? Here are the exact llama.cpp CLI commands. For Qwen (MTP is baked into the main model): ./llama-cli -m Qwen3.5-9B-UD-Q4_K_XL.gguf -p "Explain quantum computing." -n 2000 -c 8000 -ngl 99 -fa on --spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-p-min 0.7 For Gemma (Using a separate lightweight draft model): ./llama-cli -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --model-draft mtp-gemma-4-26B-A4B-it.gguf -p "Explain quantum computing." -n 2000 -c 8000 -ngl 99 -fa on --spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-p-min 0.7 Stop waiting for a $3,000 rig. Boot up Colab, pull these models, and start building your stack. I’ve put together a completely free, cell by cell Google Colab notebook that automates this entire workflow so you can test it yourself in 5 minutes and learn. Link to the notebook is in the comments below. Experiemt with different MTP parameters, context windows and post your results in the comments.

170,442 просмотров

Free NVIDIA GPU with 16 GB VRAM GPU for Running Local LLMs! If you want to master local LLMs but you're waiting until you can afford a $1,500 GPU, you're honestly not going to make it. The open source AI ecosystem is moving way too fast for you to wait on your budget to catch up. Especially when you can build a bleeding edge inference engine from scratch right now, completely for free. You don't need a heavy local rig to start. Google is literally letting you use an enterprise grade NVIDIA Tesla T4 GPU for $0/hour. At standard cloud computing rates (~$0.20/hr), Google Colab’s 4 hour daily free tier hands you roughly $24 worth of data center tier GPU compute every single month. And most people just waste it. Let’s talk about the hardware you get access to for free. The NVIDIA Tesla T4 is an absolute workhorse: - Architecture: NVIDIA Turing (TU104) - VRAM: 16GB GDDR6 (320 GB/s bandwidth) - Compute: 320 Tensor Cores | 2560 CUDA Cores - Performance: 130 TOPS INT8 | 8.1 TFLOPS FP32 - Power: Sipping energy at a max 70W TDP This is the exact same hardware I used to run DeepMind's Gemma 4 26B A4B QAT MoE at a 250,000 context window without a single Out Of Memory (OOM) crash. If you have a web browser and 10 minutes, you have everything you need. I’ve put together a fully documented, cell by cell Google Colab notebook that teaches you exactly how to do this. Here is what the notebook actually teaches you: - How to provision an Ubuntu Linux environment with CUDA 13.0 and verify your driver stack. - How to pull the source code and compile the latest llama.cpp C++ binaries from scratch, specifically optimizing the build for your exact GPU using the -DCMAKE_CUDA_ARCHITECTURES=native flag. - How to directly download quantized local LLMs (GGUF format) straight from HuggingFace using the CLI. - How to manage 16GB VRAM limits, offload neural network layers to the GPU, and push massive context windows. Compile raw llama.cpp, ollama run a model, or spin up the LM Studio CLI. Pick whatever stack you are comfortable with. just start building. No hardware. No credit card. No excuses. Bookmark this post right now so you don't lose the tutorial. Even if you don't have time to run it today, you are going to want this workflow in your engineering toolkit. The link to the free Colab Notebook is in the comments below. Lemme know if you need more tutorials like this.

178,744 просмотров

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

292,770 просмотров

i just ran Google's brand new Unsloth Gemma4 12B dense GGUF on my RTX 4060 using llama.cpp + CUDA 13.2 21 tokens per second. on a budget consumer GPU. locally. no API. no cloud. no subscription. and the benchmarks are absolutely cooked # first let's talk architecture because this is genuinely different every multimodal model you've used has a frozen vision encoder + frozen audio encoder + LLM backbone glued together Gemma 4 12B is different it's a single decoder only transformer. that's it. vision? raw 48×48 pixel patches → one matmul → projected directly into the LLM audio? raw 16kHz signal sliced into 40ms frames → linear projection → same LLM input space no encoder tax. no latency penalty. no fragmented memory to put the encoder savings in perspective: old Gemma 4 26B approach: - 550M param vision encoder (frozen) - 300M param audio encoder (frozen) - LLM backbone Gemma 4 12B: - 35M param vision embedder (a single matmul) - no audio encoder at all - LLM backbone handles EVERYTHING 550M → 35M for vision alone. that's a 15x reduction this is why the gemma-4-12b-it-Q4_K_M.gguf is just 6.6 GBs!!! and it has 256K native context context # Benchmarks: AIME 2026 (math olympiad): 77.5% GPQA Diamond (expert science): 78.8% LiveCodeBench v6 (real code): 72% Codeforces ELO: 1659 MMLU Pro: 77.2% MATH-Vision: 79.7% BigBench Extra Hard: 53% inference → llama.cpp, LM Studio, vLLM, SGLang llamacpp flags: -m "gemma-4-12b-it-Q4_K_M.gguf" -ngl 99 -c 8000 -v --port 8080 Available on huggingface now! Link below

279,768 просмотров

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

200,913 просмотров

gemma-4-12B-agentic-fable5-composer2.5 V2 is out. the agentic upgrade to the model trained on Fable 5's reasoning. Running it now with TurboQuant llama.cpp on a single RTX 4060( 8 GB VRAM) at 30 tokens/second with full 25000 context and reasoning: # The benchmarks v2 is built for coding + agentic work. writing code, running commands, using tools, debugging, multi step technical tasks. The clearest signal is tau2 bench telecom, an agentic tool use benchmark whose diagnose → fix → verify loop mirrors real terminal/debugging work: tau2 bench telecom numbers: base Gemma 4 12B: ~15% this finetune: ~55%. (Self reported) thats a huge jump # TheTom/llama-cpp-turboquant flags: llama-server.exe -m gemma4-v2-Q4_K_M.gguf -ngl 99 -c 25000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 Flag breakdown: -ngl 99 → full GPU offload -c 25000 → 25K context --cache-type-k q8_0 --cache-type-v turbo3 → mixed-precision KV cache — K at 8-bit, V at ~3-bit via TurboQuant (Walsh Hadamard rotated polar quant, Google's own KV-compression research). Not even merged into mainline llama.cpp. running it off a fork. No API. No cloud. Just llama.cpp. well, a fork of it and any 6gb+ GPU. If you tried yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF, check this out and share your experience with the models

145,356 просмотров

90% of "AI developers" just download pre packaged GGUF files from Hugging Face, hit run, and call it a day. The top 10% know how to pull the raw safetensors, run the math, and quantize massive models into Q4_K_M themselves. If you think llama.cpp can only execute models, you’re missing the best part of the open source ecosystem. It’s a high performance optimization suite. Manually stripping 69% of the VRAM footprint off a brand new model architecture is where real infrastructure value is made. If you want to actually master local inference and deploy models like Google’s massive Gemma 4 12B it on consumer NVIDIA hardware using llama.cpp, you need to learn this pipeline. Let's build it. I just took the raw 22.7 GB Gemma 4 baseline and manually compressed it down to a 7.02 GB Q4_K_M GGUF artifact using llama.cpp. That is a 69% reduction in footprint. No quality loss. No VRAM bottlenecks. Just native, hardware accelerated C++ inference running a full 2,50,000 token context window on a dual NVIDIA Tesla T4 setup. Stop melting your VRAM on unoptimized weights and stop relying on other people's pipelines. Own your stack. I mapped this entire architecture from dynamic binary fetching to raw quantization and real time GPU streaming into a single, bulletproof notebook. Notebook link is in the comments below. Bookmark this blueprint for your next deployment and tell me which quantization works best for your workflow and model.

62,631 просмотров

I just got Gemma 4 26B A4B MoE model running fully locally with Hermes agent on an 8GB RTX 4060 and it's now backtesting trading strategies end to end, no hand holding. If you’re a trader or work on Wall Street, you don’t want to miss this. Yes. fully automated. No cloud. No APIs beyond market data. # Here's what I did: Setup: - Model: Gemma 4 26B-A4B QAT (MoE), Q4_K_XL Unsloth's quant (link in the comments) - Inference: llama.cpp (turboquant fork by Tom Turney link in the comments) - Hardware: RTX 4060, 8GB VRAM + 16GB RAM only (with 50 other chrome tabs open) - Context: 64K llama.cpp turboquant flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 turboquant helps achieve high prefill and decode throughput for interactive sessions. throughput with Hermes agent: decode: 25+ tokens/sec prefill: 250+ tokens/sec # Then I gave the agent one task: Backtest a strategy: - Buy when RSI crosses above 30 - Sell at +2% profit or -1% stoploss - No overlapping positions - Use Google stock via yfinance - Generate a full HTML report with candlestick charts + signals What happened next was wild. It didn't just write code, it ran the entire workflow itself: Audited the environment (pip list, dependency check) Hit a ModuleNotFoundError, multiple Python installs were conflicting Ran where python to map every interpreter on the system Manually selected the correct Python 3.13 path and re ran the script Wrote a clean statevmachine backtester (strict no overlapping trades logic) Patched a yfinance MultiIndex quirk that would've crashed the script Built Plotly candlestick + RSI charts with buy/sell markers Calculated win rate, PnL, and summary stats Exported a polished single file HTML report. check the report at the end of the video or in the comments. Biggest takeaway: local LLMs aren't just "chat assistants" anymore. They debug their own environment, write production code, and ship a finished deliverable on consumer hardware, for $0 in API costs. If you're still calling local models "toys," you're already behind. This is just the beginning. Hermes agent just surpassed 1 trillion tokens in a single day on OpenRouter. Think about the scale of total token generation happening right now. Disclaimer: This is not financial advice. Consult a professional before making any trading decisions.

105,094 просмотров

Right now, you may not have access to models like GPT‑5.6 Sol, GPT‑4.6 Terra, GPT‑5.6 Luna, Claude Mythos 5, or Claude Fable 5. But you can run something surprisingly powerful today, locally, and completely free. in the next 10 mins on your 8 GB VRAM gaming laptop. Gemma 4 26B A4B QAT (MoE) delivers strong performance on a standard 8 GB VRAM GPU using Ollama, with no API, no usage limits, and no external dependencies. Out of the box, it reaches around 20 tokens per second without any optimizations. Only one command in your terminal: Ollama run gemma4:26b This means: Full offline capability (privacy by default) Zero recurring cost Competitive performance for many real world tasks Fast enough for interactive use on cheap consumer hardware If you're waiting for cutting edge cloud models, you're missing what is already practical today: a capable, local LLM that runs entirely on your own machine.

65,251 просмотров

my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy. but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies. so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine? # Hardware: GPU: NVIDIA RTX 4060, 8 GB VRAM RAM: 16 GB CPU: Intel Core i7 H Laptop. Gaming. Modest. The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf (model's unsloth huggingface link in the comments) This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded. # the flags I used: -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup. # Results: → Decode: ~3 tokens/sec → Prefill: ~2 tokens/sec → Context: 6000 tokens → Hardware crying quietly in the corner: yes so is 3 tps actually usable? For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps. but slow ≠ useless. And this is where it gets genuinely interesting. think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior. That's exactly the local AI agent architecture this unlocks: → Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev. → Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus) other workflows where 3 tps is completely fine: - overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results. - One shot deep reasoning - Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints - Any workflow where output quality > output speed A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting. Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping. the tools are here. the models are here. you just have to be willing to abuse your laptop a little. what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.

63,583 просмотров

you're paying $20/mo for something your $500 GPU can already do. Gemma 4 26B A4B QAT MoE + Hermes Agent running on a single RTX 4060 (8GB VRAM). Built a vision capable, 100% free, 100% local, private AI assistant that lives in my Chrome browser. No API keys. No cloud. No subscriptions. 100% vibe coded. 0% handholding. It has full context of whatever's on my screen can answer questions, summarize pages, extract data, and see images. Same local model handles everything, no external calls, ever. keep reading for the model and hermes agent tips i learnt while building this locally. Here's the exact setup for anyone running local LLMs on 6-8 GB VRAM: llama.cpp server flags (on my NVIDIA RTX 4060 8gb VRAM): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --cache-type-k q8_0 --cache-type-v q8_0 -c 150000 --port 8080 Throughput with quantization: Prefill: 200-250 tokens/sec Decode: 20-25 tokens/sec reduce context if oom on 6 gb vram card. Key learnings: - Quantize KV cache to q8 for faster prefill/decode. Prefill goes from 100-150 (unquantized) to 200-250 tok/s (q8). - But watch out, once actual context grows past ~50k tokens on high entropy workloads, q8 KV quantization can cause hallucinations. Low entropy workloads are mostly unaffected. If you see it happening, drop the quantization. This is common across all local models. - In Hermes Agent settings -> Memory & Context, bump compression threshold from default 0.5 to 0.7. Default triggers way too frequent context compression and eats time. Up next: add persistent memory, web search, tool calling, streaming output and whatever you suggest. Running a 26B MoE with vision + 150k context window on 8GB VRAM would've sounded impossible 6 months ago. Works the same on the NVIDIA RTX 3060 Ti, 3070, 4060 Ti, 5060, 2080, or any 8GB card. VRAM is the only requirement. Local AI agents are closer than people think. You just need to know where the knobs are. Model's Unsloth quant hugging face link in the comments. Have you tried Hermes agent by Nous Research yet? What are you building with local LLMs? Drop it below, let's see what this community is shipping.

36,031 просмотров

Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM: Prefill: 1000+ tok/s (42% increase) Decode: 25+ tok/s (25% increase) Context: 120k (150% increase) prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments) llama.cpp TurboQuant flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --port 8080 tested with a 27k prompt, 120k context loaded. -ngl 99 here isn't a typo, full 12B dense, every layer on GPU, on an 8GB card. that's the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card. TurboQuant's KV cache savings are what free up the room to do that at 120k context. side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+ rig: RTX 4060 8GB · i7H · 16GB RAM same two flags as yesterday, different model size: --cache-type-k q8_0 --cache-type-v turbo3 thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (Tom Turney) to make this work. unsloth's model quant huggingface and the llama.cpp fork github link in the comments Do you prefer a dense or a MoE for your 8GB card?

34,500 просмотров

Claude fable 5 is cooked Claude fable 5 Vs Gemma 4 26b a4b qat MTP Gemma 4 26b running locally on my 8GB vram single RTX 4060 built this using three.js in a single session and 3 prompts. no cloud, no subscription 100% private unlimited use. how long until it can catch up completely?

30,802 просмотров

Testing the new Gemma 4 12B (QAT) vision and OCR capabilities locally with LM Studio. # The setup: - GPU: NVIDIA RTX 4060 (8GB VRAM) - CPU: Intel i7 - Runner: LM Studio - Config: 32k context, 38 layers offloaded, Flash Attention enabled - Speed: ~14 tokens/sec decode throughput # The test: I gave it a screenshot of Google AI Studio. Prompt: "clone this. give me a single html file" # The result: A solid one shot replication. It successfully mapped out the layout, recognized the UI text, and structured the divs correctly, with only minor differences from the original. Results available at the end of the video. Quite capable for a 12B model running on budget consumer hardware. A gpu that costs only $300. # Why the architecture under the hood is notable: Unlike traditional models that rely on heavy, separate vision and audio encoders, Gemma 4 12B uses a unified, encoder free architecture. It bypasses separate multi stage encoders. Uses a 35M parameter vision embedder to project raw 48x48 pixel patches directly to the LLM hidden dimension. Local multimodal development is becoming highly accessible on standard hardware. If you've spun up Gemma 4 12B locally, what setup are you using and what kind of throughput are you seeing?

25,717 просмотров

Open source AI is actually moving at an unhinged pace right now. I literally hadn't even finished typing up my last Gemma 4 12b benchmark notes before Google went ahead and dropped the official Quantization Aware Training (QAT) checkpoints on Hugging Face. If you missed the news, QAT basically bakes the compression directly into the training process. Instead of standard post training quantization degrading the model's reasoning capabilities, QAT trains the model with compression in mind. Unsloth is reporting near original performance at 4-bit with ~72% lower memory footprint. Details in the comments. Naturally, had to instantly pull the new GGUFs to see what a single RTX 4090 card (24 GB VRAM, Cuda 12.8, ubuntu 22) could do. i fired up llama.cpp engine again Look at these numbers: 1. Unsloth Gemma 4 26B-A4B IT (QAT Q4_K_XL) flags: ./build/bin/llama-cli -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 250000 -fa on -v VRAM Used: 19.5 GB context: 250,000 tokens decode throughput: 193 tps 2. Unsloth Gemma 4 31B IT (QAT Q4_K_XL) flags: Command: ./build/bin/llama-cli -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 60000 -fa on -v - VRAM Used: 23 GB (Tight, but zero system RAM spillover) - context: 60,000 tokens - decode throughput: 47 tps We are essentially watching hardware bottlenecks evaporate in real time. An update literally drops before you can finish benchmarking the previous one. What a time to be running local hardware. If you have a single rtx 3090, rtx 4090, these are the latest gemma models to try this week.

26,841 просмотров

I just ran Gemma 4 31B on @CerebrasSystems at 1,800+ tokens/sec and it's multimodal. For context: that's 35x faster than a typical GPU endpoint, and the first token (reasoning included) lands in 1.5 seconds. This isn't a benchmark slide, I recorded the inference live. Prompt I used: "Create a simulation of an iPhone. Include at least one working dummy note taking app, a functional notification pulldown, high quality graphics, single HTML file, any libs via CDN." - Generation time: 3 seconds. - Notes app worked. - Notification panel worked. - Rendered first try. This is what wafer-scale inference unlocks, not just "faster," but a different category of product. When generation is this fast, you stop waiting and start iterating in real time. Why this matters: Gemma 4 31B is Google DeepMind's flagship open weight model, Apache 2.0 licensed, dense (not MoE), and built for efficiency over raw parameter count. It scores close to Claude Haiku 4.5 on the Artificial Analysis Intelligence Index (30 vs 29) but runs ~18x faster on Cerebras. It's also the first multimodal model on Cerebras's platform, meaning you can now feed it screenshots, documents, charts, and UI states at wafer scale speed. # Applications I'm most excited about: - Screenshot → Insight: Drop in a dashboard or document screenshot, get structured findings back instantly. no waiting, no batching. - Live UI generation: Full interactive interfaces (like my iPhone sim) generated and rendered in under 2 seconds. - Screenshot -> Patch: Feed it a broken UI + console error, get a minimal code fix and verification steps back. - Computer use & agentic loops: See -> reason -> act - verify, fast enough to keep a human in the loop instead of waiting on the model. - Long context summarization: Full research reports condensed into decision ready summaries you can read and requery in one sitting. The bigger unlock isn't the speed number itself, it's that agentic and multimodal loops (see -> reason -> output -> tool call -> verify -> retry) finally run in real time instead of feeling sluggish. As Logan Kilpatrick (Logan Kilpatrick) put it: "If every model was doing 2,000 tokens per second, you wouldn't build the same product and just have it be faster, you'd build different products." Gemma 4 31B is live now on Cerebras Inference Cloud in public preview. If you're building multimodal, agentic, or real time apps, this is worth testing today. What would you build with such insane inference throughput?

12,962 просмотров

Videos

LIVE

1.2k

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Streaming Now

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

HD live stream

Exclusive private shows

1.2k viewers online

Current Status

Live

Private Show

Join now for exclusive access

Free preview available • Premium content