Alok's banner
Alok's profile picture

Alok

@analogalok1,226 subscribers

Mechatronics Engineer AI belongs on your device. • Offline inference • No subscriptions. Teaching you to own your AI Intelligence Stack

Shorts

This is the most hilarious thing I saw and did today Ran gemma-4-12B-coder-fable5-composer2.5-v1-GGUF locally with 8 GB VRAM at 20+ tok/sec Anthropic's Claude Fable 5 launched June 9. By June 12 it was banned. I can't access it. You can't either. But here's the twist: I'm running a model trained on its chain of thought at 20 tok/s on my RTX 4060 8GB. Locally. Offline. No cloud. No export control. Enter: Gemma4-12B-Coder GGUF (Q4_K_M) Base: Google's gemma-4-12B-it Fine-tuned on verifiable Python CoT data: - Primary: Composer 2.5 real reasoning traces (only passing solutions kept) - Auxiliary: Fable 5 used to redo the hard cases Composer missed. Every training example's reasoning led to code that actually ran. No hallucinated logic. Llama.cpp flags: -m gemma4-coding-Q4_K_M.gguf -cnv -ngl 44 -c 64000 -v (huggingface model link in comments) Flag breakdown: -ngl 44 → offload 44 layers to GPU (tune this for your VRAM) -c 64000 → 64K context window -cnv → conversation/chat mode -v → verbose output The irony writes itself. Anthropic spent weeks telling the world Fable 5 (mythos) is too powerful to release. Then released it. Then got banned from serving it, including their own researchers. Meanwhile: a Gemma 4 12B fine tune, trained on Fable 5's reasoning, runs fully offline on my mid range consumer GPU No API. No cloud. Just me and llama.cpp. This is why local AI matters. Check out the model's link in the comments. How's your experience been with this model?

This is the most hilarious thing I saw and did today Ran gemma-4-12B-coder-fable5-composer2.5-v1-GGUF locally with 8 GB VRAM at 20+ tok/sec Anthropic's Claude Fable 5 launched June 9. By June 12 it was banned. I can't access it. You can't either. But here's the twist: I'm running a model trained on its chain of thought at 20 tok/s on my RTX 4060 8GB. Locally. Offline. No cloud. No export control. Enter: Gemma4-12B-Coder GGUF (Q4_K_M) Base: Google's gemma-4-12B-it Fine-tuned on verifiable Python CoT data: - Primary: Composer 2.5 real reasoning traces (only passing solutions kept) - Auxiliary: Fable 5 used to redo the hard cases Composer missed. Every training example's reasoning led to code that actually ran. No hallucinated logic. Llama.cpp flags: -m gemma4-coding-Q4_K_M.gguf -cnv -ngl 44 -c 64000 -v (huggingface model link in comments) Flag breakdown: -ngl 44 → offload 44 layers to GPU (tune this for your VRAM) -c 64000 → 64K context window -cnv → conversation/chat mode -v → verbose output The irony writes itself. Anthropic spent weeks telling the world Fable 5 (mythos) is too powerful to release. Then released it. Then got banned from serving it, including their own researchers. Meanwhile: a Gemma 4 12B fine tune, trained on Fable 5's reasoning, runs fully offline on my mid range consumer GPU No API. No cloud. Just me and llama.cpp. This is why local AI matters. Check out the model's link in the comments. How's your experience been with this model?

401,076 views

Videos

analogalok's profile picture

Auto regressive LLMs are officially on notice. run Gemma 4 26B diffusion gguf with llama.cpp Google just dropped DiffusionGemma-26B, and it completely flips how we generate text. instead of predicting words one by one, it generates 256 tokens in parallel using bi-directional attention. its like stable diffusion, but for language. the model starts with random text "noise" and iteratively refines and self-corrects the entire block in real-time to fix formatting and reasoning errors on the fly. since it’s a Mixture of Experts (MoE) that only activates 3.8B parameters during inference, it fits perfectly on consumer hardware. You can run the Q4_K_M quant with an 18GB VRAM budget on a single RTX 3090 or RTX 4090 with exceptional throughput. Tested on Ubuntu 22 with CUDA 13.1 using the cutting edge experimental llama.cpp branch. Here is how to compile and run it with the live terminal denoising visualizer: # 1. Clone & check out the experimental PR (#24423) - 1) git clone && cd llama.cpp -git fetch origin 2) pull/24423/head:diffusiongemma && --git checkout diffusiongemma # 2. Build with CUDA support 1) cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native 2) cmake --build build -j $(nproc) --config Release --target llama-diffusion-cli # 3. Run with live visual denoising (llama.cpp flags) ./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 --diffusion-visual Watch the video below to see the live --diffusion-visual canvas iteratively de noising the prompt output in real time. guide and unsloth's hugging face GGUF model links are in the comments below! Is auto regressive generation officially legacy tech? Let me know what you think.

Alok

52,656 views • 5 days ago

No more content to load