Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

🔥Apple MLX first 6bit model is on Hugging Face!🔥 Qwen2.5-Coder-32B-Instruct-6bit! 3bit conversion and test in progress! Video 8x below on M4 Max 40GPU: - Prompt: 38 tokens, 61.731 tokens-per-sec - Generation: 1181 tokens, 16.939 tokens-per-sec - Peak memory: 25.122 GB

Ivan Fioravanti ᯅ

21,712 subscribers

46,494 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie Nachrichten & Politik

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

MLX + M3 Ultra 512GB + Qwen3-235B-A22B-8bit = 🔥 "write a beautiful p5js particles animation that reacts to mouse clicks movements" Prompt: 22 tokens, 64.471 tokens-per-sec Generation: 6197 tokens, 18.916 tokens-per-sec Peak memory: 251.077 GB

MLX + M3 Ultra 512GB + Qwen3-235B-A22B-8bit = 🔥 "write a beautiful p5js particles animation that reacts to mouse clicks movements" Prompt: 22 tokens, 64.471 tokens-per-sec Generation: 6197 tokens, 18.916 tokens-per-sec Peak memory: 251.077 GB

Ivan Fioravanti ᯅ

24,160 Aufrufe • vor 1 Jahr

MLX MiniMax 2.5 running LOCALLY on a single M3 Ultra 512GB! Writing a poem on LLMs at 6bit quantization! 🔥 Let's start some coding, context and distributed tests! Generation: 40.2 tokens-per-sec Peak memory: 186 GB

MLX MiniMax 2.5 running LOCALLY on a single M3 Ultra 512GB! Writing a poem on LLMs at 6bit quantization! 🔥 Let's start some coding, context and distributed tests! Generation: 40.2 tokens-per-sec Peak memory: 186 GB

Ivan Fioravanti ᯅ

226,126 Aufrufe • vor 5 Monaten

DeepSeek R1 Qwen 7B 4bit M2 Ultra vs M4 Max on Apple MLX 🤫 Let them think... (video 4x in center part) M2 Ultra: 114.9 tokens per sec M4 Max (14"): 88.3 tokens per sec

DeepSeek R1 Qwen 7B 4bit M2 Ultra vs M4 Max on Apple MLX 🤫 Let them think... (video 4x in center part) M2 Ultra: 114.9 tokens per sec M4 Max (14"): 88.3 tokens per sec

Ivan Fioravanti ᯅ

60,200 Aufrufe • vor 1 Jahr

/1 Gemma 4 31B just crushed Qwen 3.6 27B in a local LLM gamedev contest inside atomic.chat (prompt is below) Device: MacBook Pro M5 Max, 64GB RAM Results: Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens Gemma 4 31B: 27 tokens/sec · 3m 51s · 6,209 tokens So what is more important: tokens per second, or the quality of the final answer? Qwen made a very long response and showed more creativity and visual style. But Gemma gave a shorter, clearer, and more logical answer in much less time. In this one-shot Pac-Man gamedev contest, Gemma 4 31B was the clear winner. Its game logic was stronger: click reactions were smoother, and it handled interactions with elements like walls, ghosts, and particle effects better. But this was only one test. Maybe Qwen 3.6 27B can show better results with better settings. Open the comments, try our prompt, and share your result below.

/1 Gemma 4 31B just crushed Qwen 3.6 27B in a local LLM gamedev contest inside atomic.chat (prompt is below) Device: MacBook Pro M5 Max, 64GB RAM Results: Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens Gemma 4 31B: 27 tokens/sec · 3m 51s · 6,209 tokens So what is more important: tokens per second, or the quality of the final answer? Qwen made a very long response and showed more creativity and visual style. But Gemma gave a shorter, clearer, and more logical answer in much less time. In this one-shot Pac-Man gamedev contest, Gemma 4 31B was the clear winner. Its game logic was stronger: click reactions were smoother, and it handled interactions with elements like walls, ghosts, and particle effects better. But this was only one test. Maybe Qwen 3.6 27B can show better results with better settings. Open the comments, try our prompt, and share your result below.

Chubby♨️

72,566 Aufrufe • vor 2 Monaten

Fuck yeah! MaskGCT - New open SoTA Text to Speech model! 🔥 > Zero-shot voice cloning > Emotional TTS > Trained on 100K hours of data > Long form synthesis > Variable speed synthesis > Bilingual - Chinese & English > Available on Hugging Face Fully non-autoregressive architecture: > Stage 1: Predicts semantic tokens from text, using tokens extracted from a speech self-supervised learning (SSL) model > Stage 2: Predicts acoustic tokens conditioned on the semantic tokens. Synthesised: "Would you guys personally like to have a fake fireplace, an electric one, in your house? Or would you rather have a real fireplace? Let me know down below. Okay everybody, that's all for today's video and I hope you guys learned a bunch of furniture vocabulary!" TTS scene keeps getting lit! 🐐

Fuck yeah! MaskGCT - New open SoTA Text to Speech model! 🔥 > Zero-shot voice cloning > Emotional TTS > Trained on 100K hours of data > Long form synthesis > Variable speed synthesis > Bilingual - Chinese & English > Available on Hugging Face Fully non-autoregressive architecture: > Stage 1: Predicts semantic tokens from text, using tokens extracted from a speech self-supervised learning (SSL) model > Stage 2: Predicts acoustic tokens conditioned on the semantic tokens. Synthesised: "Would you guys personally like to have a fake fireplace, an electric one, in your house? Or would you rather have a real fireplace? Let me know down below. Okay everybody, that's all for today's video and I hope you guys learned a bunch of furniture vocabulary!" TTS scene keeps getting lit! 🐐

Vaibhav (VB) Srivastav

139,105 Aufrufe • vor 1 Jahr

AN AWS ENGINEER QUIETLY BUILT A 2 PETABYTE HOME SERVER FOR $9/MONTH THAT KILLS A $3,400/MONTH CLOUD STORAGE BILL the lenovo thinkstation pgx ships nvidia's gb10 grace blackwell superchip and 128gb of unified memory in a box the size of a mac mini at 1.2kg it runs an 80b qwen3 coder model at 25 to 40 tokens per second and a 196b step-3.5-flash moe model at 20 tokens per second locally the gb10 packs 6,144 cuda cores, 192 fifth-generation tensor cores and rates at 1 petaflop of fp4 with sparsity from a single 240 watt usb-c power supply fine tuning qwen 2.5 7b with lora took 18 minutes and 41gb of unified memory while the gpu pulled 65 watts and peaked at 77 degrees the box pulls a docker container from nvidia's registry and serves a frontier model on your local network with tool calling and zero data leaving your desk bookmark this and read the article below

AN AWS ENGINEER QUIETLY BUILT A 2 PETABYTE HOME SERVER FOR $9/MONTH THAT KILLS A $3,400/MONTH CLOUD STORAGE BILL the lenovo thinkstation pgx ships nvidia's gb10 grace blackwell superchip and 128gb of unified memory in a box the size of a mac mini at 1.2kg it runs an 80b qwen3 coder model at 25 to 40 tokens per second and a 196b step-3.5-flash moe model at 20 tokens per second locally the gb10 packs 6,144 cuda cores, 192 fifth-generation tensor cores and rates at 1 petaflop of fp4 with sparsity from a single 240 watt usb-c power supply fine tuning qwen 2.5 7b with lora took 18 minutes and 41gb of unified memory while the gpu pulled 65 watts and peaked at 77 degrees the box pulls a docker container from nvidia's registry and serves a frontier model on your local network with tool calling and zero data leaving your desk bookmark this and read the article below

starmex

192,758 Aufrufe • vor 1 Monat

MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25% MTP drafts several tokens ahead and verifies them in one pass. The speedup depends on memory moved per pass. Dense 27B reads all 27B params per token, MoE 35B-A3B only reads 3B active. Dense had way more to save by batching. The baseline tps also differ (218 vs 51) for the same reason from the other side. Token generation is memory-bandwidth bound, and MoE moves ~8x less memory per token, so its baseline is already 4x ahead. ~80% draft acceptance. Zero accuracy loss. ~1 GB extra VRAM. Open-source code and local AI app – in the comments 👇

MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25% MTP drafts several tokens ahead and verifies them in one pass. The speedup depends on memory moved per pass. Dense 27B reads all 27B params per token, MoE 35B-A3B only reads 3B active. Dense had way more to save by batching. The baseline tps also differ (218 vs 51) for the same reason from the other side. Token generation is memory-bandwidth bound, and MoE moves ~8x less memory per token, so its baseline is already 4x ahead. ~80% draft acceptance. Zero accuracy loss. ~1 GB extra VRAM. Open-source code and local AI app – in the comments 👇

atomic.chat

170,704 Aufrufe • vor 2 Monaten

Most recent diffusion language model research (that I’ve seen) seems to be using masking as the noising process. It looks like, however, most closed-source models (Google Gemini Diffusion and possibly Inception Labs’ Mercury) use a different noising process, where instead of masking tokens, they replace them with different tokens (either with a random token or a semantically similar token). I wondered how they were getting such high throughput with the latter noising process, since I believed that optimizing inference with KVCache approximation would be more difficult (for various reasons). I visualized this noising process with tiny-diffusion and compared it to normal unmasking, and was very surprised to see how fast the generation “settles” into a reasonable output, and then only slightly refines afterwards, requiring much fewer steps in total. Unmasking (where tokens are never remasked, the typical implementation) is inherently limited in generation speed by the fact that an increase in tokens decoded per step leads to more errors due to the mismatch between individual and marginal token probability distributions we sample from. The token replacement noising process seems to have a much different set of characteristics. Because we sample each token per step, every token makes “progress” towards the final output each iteration (in addition to *potentially* giving other tokens more information in future steps). Generally, masking has outperformed other noising processes, which is probably why most research focused on it (using smaller models). But the paper referred to in the retweet shows that random replacement as a noising process may scale better as model size increases. Big labs might have noticed these results much earlier (due to having drastically more training resources and being able to test larger models), which may explain the discrepancy in the choice of noising process. I’m gonna test this with larger models, since tiny-diffusion only has 10M parameters.

Most recent diffusion language model research (that I’ve seen) seems to be using masking as the noising process. It looks like, however, most closed-source models (Google Gemini Diffusion and possibly Inception Labs’ Mercury) use a different noising process, where instead of masking tokens, they replace them with different tokens (either with a random token or a semantically similar token). I wondered how they were getting such high throughput with the latter noising process, since I believed that optimizing inference with KVCache approximation would be more difficult (for various reasons). I visualized this noising process with tiny-diffusion and compared it to normal unmasking, and was very surprised to see how fast the generation “settles” into a reasonable output, and then only slightly refines afterwards, requiring much fewer steps in total. Unmasking (where tokens are never remasked, the typical implementation) is inherently limited in generation speed by the fact that an increase in tokens decoded per step leads to more errors due to the mismatch between individual and marginal token probability distributions we sample from. The token replacement noising process seems to have a much different set of characteristics. Because we sample each token per step, every token makes “progress” towards the final output each iteration (in addition to potentially giving other tokens more information in future steps). Generally, masking has outperformed other noising processes, which is probably why most research focused on it (using smaller models). But the paper referred to in the retweet shows that random replacement as a noising process may scale better as model size increases. Big labs might have noticed these results much earlier (due to having drastically more training resources and being able to test larger models), which may explain the discrepancy in the choice of noising process. I’m gonna test this with larger models, since tiny-diffusion only has 10M parameters.

nathan (in sf)

40,440 Aufrufe • vor 6 Monaten

This Chinese developer launched Llama 70B locally on a MacBook on a plane and for a full 11 hours without internet ran client projects. He was sitting by the window on a transatlantic flight with a MacBook Pro M4 with 64 GB of memory. WiFi on board cost $25 for the flight. He declined. No cloud API, no connection to Anthropic or OpenAI servers, no internet at all. Just a local Llama 3.3 70B on bf16 and his own orchestrator script. The model runs through llama.cpp. Generation speed, 71 tokens per second. Context around 60,000 tokens. Memory usage, 48.6 GiB out of 64. Battery at takeoff, 3 hours 21 minutes. And he gave the orchestrator this system prompt before takeoff: "You are an offline orchestrator running on a single MacBook. There is no network. The only resources you have are local files in /Users/dev/work, the Llama 70B inference server at localhost:8080, and a battery budget of 3 hours 21 minutes. Process the queue at /Users/dev/work/queue.jsonl (one client task per line). For each task: draft → run local evals → save artefact to /Users/dev/work/done/. Save context checkpoints every 12 tasks so you can resume after a battery swap. Stop only on empty queue or when battery drops below 5%." So the system knows exactly what resources it is running on. It knows it has no connection to the outside world for the next 11 hours. It knows it has finite memory and a finite battery. It knows the human will not intervene until the plane lands. The system runs in 1 loop. Takes a task from the queue, runs it through inference, saves the artifact, writes a checkpoint. Task after task, just like that. And only when the battery drops below 5% does the orchestrator automatically pause, waits for the laptop to switch to the backup power bank, and continues from the last checkpoint. Here is what the system actually writes in his log during the flight: "saved context checkpoint 8 of 12 (pos_min = 488, pos_max = 50118, size = 62.813 MiB)" "restored context checkpoint (pos_min = 488, pos_max = 50118)" "prompt processing progress: n_tokens = 50 / 60 818" "task 37016 done | tps = 71 s tokens text → /Users/dev/work/done/proposal_westside.md" Outside the window, clouds, blue sky, and no WiFi. On the tray, 1 MacBook, an open terminal on 2 screens, and an inference server on localhost. From what I have observed, this is the cleanest offline AI workflow I have seen in the past year: 11 hours of flight, $0 for WiFi, and the entire client queue closed before landing.

This Chinese developer launched Llama 70B locally on a MacBook on a plane and for a full 11 hours without internet ran client projects. He was sitting by the window on a transatlantic flight with a MacBook Pro M4 with 64 GB of memory. WiFi on board cost $25 for the flight. He declined. No cloud API, no connection to Anthropic or OpenAI servers, no internet at all. Just a local Llama 3.3 70B on bf16 and his own orchestrator script. The model runs through llama.cpp. Generation speed, 71 tokens per second. Context around 60,000 tokens. Memory usage, 48.6 GiB out of 64. Battery at takeoff, 3 hours 21 minutes. And he gave the orchestrator this system prompt before takeoff: "You are an offline orchestrator running on a single MacBook. There is no network. The only resources you have are local files in /Users/dev/work, the Llama 70B inference server at localhost:8080, and a battery budget of 3 hours 21 minutes. Process the queue at /Users/dev/work/queue.jsonl (one client task per line). For each task: draft → run local evals → save artefact to /Users/dev/work/done/. Save context checkpoints every 12 tasks so you can resume after a battery swap. Stop only on empty queue or when battery drops below 5%." So the system knows exactly what resources it is running on. It knows it has no connection to the outside world for the next 11 hours. It knows it has finite memory and a finite battery. It knows the human will not intervene until the plane lands. The system runs in 1 loop. Takes a task from the queue, runs it through inference, saves the artifact, writes a checkpoint. Task after task, just like that. And only when the battery drops below 5% does the orchestrator automatically pause, waits for the laptop to switch to the backup power bank, and continues from the last checkpoint. Here is what the system actually writes in his log during the flight: "saved context checkpoint 8 of 12 (pos_min = 488, pos_max = 50118, size = 62.813 MiB)" "restored context checkpoint (pos_min = 488, pos_max = 50118)" "prompt processing progress: n_tokens = 50 / 60 818" "task 37016 done | tps = 71 s tokens text → /Users/dev/work/done/proposal_westside.md" Outside the window, clouds, blue sky, and no WiFi. On the tray, 1 MacBook, an open terminal on 2 screens, and an inference server on localhost. From what I have observed, this is the cleanest offline AI workflow I have seen in the past year: 11 hours of flight, $0 for WiFi, and the entire client queue closed before landing.

Blaze

1,839,572 Aufrufe • vor 2 Monaten

you're paying $20/mo for something your $500 GPU can already do. Gemma 4 26B A4B QAT MoE + Hermes Agent running on a single RTX 4060 (8GB VRAM). Built a vision capable, 100% free, 100% local, private AI assistant that lives in my Chrome browser. No API keys. No cloud. No subscriptions. 100% vibe coded. 0% handholding. It has full context of whatever's on my screen can answer questions, summarize pages, extract data, and see images. Same local model handles everything, no external calls, ever. keep reading for the model and hermes agent tips i learnt while building this locally. Here's the exact setup for anyone running local LLMs on 6-8 GB VRAM: llama.cpp server flags (on my NVIDIA RTX 4060 8gb VRAM): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --cache-type-k q8_0 --cache-type-v q8_0 -c 150000 --port 8080 Throughput with quantization: Prefill: 200-250 tokens/sec Decode: 20-25 tokens/sec reduce context if oom on 6 gb vram card. Key learnings: - Quantize KV cache to q8 for faster prefill/decode. Prefill goes from 100-150 (unquantized) to 200-250 tok/s (q8). - But watch out, once actual context grows past ~50k tokens on high entropy workloads, q8 KV quantization can cause hallucinations. Low entropy workloads are mostly unaffected. If you see it happening, drop the quantization. This is common across all local models. - In Hermes Agent settings -> Memory & Context, bump compression threshold from default 0.5 to 0.7. Default triggers way too frequent context compression and eats time. Up next: add persistent memory, web search, tool calling, streaming output and whatever you suggest. Running a 26B MoE with vision + 150k context window on 8GB VRAM would've sounded impossible 6 months ago. Works the same on the NVIDIA RTX 3060 Ti, 3070, 4060 Ti, 5060, 2080, or any 8GB card. VRAM is the only requirement. Local AI agents are closer than people think. You just need to know where the knobs are. Model's Unsloth quant hugging face link in the comments. Have you tried Hermes agent by Nous Research yet? What are you building with local LLMs? Drop it below, let's see what this community is shipping.

you're paying $20/mo for something your $500 GPU can already do. Gemma 4 26B A4B QAT MoE + Hermes Agent running on a single RTX 4060 (8GB VRAM). Built a vision capable, 100% free, 100% local, private AI assistant that lives in my Chrome browser. No API keys. No cloud. No subscriptions. 100% vibe coded. 0% handholding. It has full context of whatever's on my screen can answer questions, summarize pages, extract data, and see images. Same local model handles everything, no external calls, ever. keep reading for the model and hermes agent tips i learnt while building this locally. Here's the exact setup for anyone running local LLMs on 6-8 GB VRAM: llama.cpp server flags (on my NVIDIA RTX 4060 8gb VRAM): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --cache-type-k q8_0 --cache-type-v q8_0 -c 150000 --port 8080 Throughput with quantization: Prefill: 200-250 tokens/sec Decode: 20-25 tokens/sec reduce context if oom on 6 gb vram card. Key learnings: - Quantize KV cache to q8 for faster prefill/decode. Prefill goes from 100-150 (unquantized) to 200-250 tok/s (q8). - But watch out, once actual context grows past ~50k tokens on high entropy workloads, q8 KV quantization can cause hallucinations. Low entropy workloads are mostly unaffected. If you see it happening, drop the quantization. This is common across all local models. - In Hermes Agent settings -> Memory & Context, bump compression threshold from default 0.5 to 0.7. Default triggers way too frequent context compression and eats time. Up next: add persistent memory, web search, tool calling, streaming output and whatever you suggest. Running a 26B MoE with vision + 150k context window on 8GB VRAM would've sounded impossible 6 months ago. Works the same on the NVIDIA RTX 3060 Ti, 3070, 4060 Ti, 5060, 2080, or any 8GB card. VRAM is the only requirement. Local AI agents are closer than people think. You just need to know where the knobs are. Model's Unsloth quant hugging face link in the comments. Have you tried Hermes agent by Nous Research yet? What are you building with local LLMs? Drop it below, let's see what this community is shipping.

Alok

36,031 Aufrufe • vor 26 Tagen

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Alok

292,770 Aufrufe • vor 1 Monat

my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy. but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies. so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine? # Hardware: GPU: NVIDIA RTX 4060, 8 GB VRAM RAM: 16 GB CPU: Intel Core i7 H Laptop. Gaming. Modest. The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf (model's unsloth huggingface link in the comments) This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded. # the flags I used: -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup. # Results: → Decode: ~3 tokens/sec → Prefill: ~2 tokens/sec → Context: 6000 tokens → Hardware crying quietly in the corner: yes so is 3 tps actually usable? For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps. but slow ≠ useless. And this is where it gets genuinely interesting. think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior. That's exactly the local AI agent architecture this unlocks: → Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev. → Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus) other workflows where 3 tps is completely fine: - overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results. - One shot deep reasoning - Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints - Any workflow where output quality > output speed A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting. Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping. the tools are here. the models are here. you just have to be willing to abuse your laptop a little. what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.

my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy. but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies. so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine? # Hardware: GPU: NVIDIA RTX 4060, 8 GB VRAM RAM: 16 GB CPU: Intel Core i7 H Laptop. Gaming. Modest. The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf (model's unsloth huggingface link in the comments) This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded. # the flags I used: -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup. # Results: → Decode: ~3 tokens/sec → Prefill: ~2 tokens/sec → Context: 6000 tokens → Hardware crying quietly in the corner: yes so is 3 tps actually usable? For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps. but slow ≠ useless. And this is where it gets genuinely interesting. think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior. That's exactly the local AI agent architecture this unlocks: → Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev. → Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus) other workflows where 3 tps is completely fine: - overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results. - One shot deep reasoning - Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints - Any workflow where output quality > output speed A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting. Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping. the tools are here. the models are here. you just have to be willing to abuse your laptop a little. what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.

Alok

63,583 Aufrufe • vor 1 Monat

MiniMax M3 just dropped — their first natively multimodal model. So I ran it through my form-filling test. (The model has to place each element at the right pixel position on a blank form image, not type into a field.) Verdict: it got everything on the paper. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code, all there. > Best character spacing I've seen yet: it actually calculates the gap between each character, clean across the DOB and number boxes > A few fields slightly misaligned, but every piece of data made it onto the form The reasoning chain is the interesting part: it does the easy fields first, then works into the tight one-char-per-box fields, reasoning through y-coordinates, baselines, and label clearance in obsessive detail. The cost: 40:33 and 126.7k output tokens. That's a long think — but it's MiniMax's first multimodal model, and it nailed the content.

MiniMax M3 just dropped — their first natively multimodal model. So I ran it through my form-filling test. (The model has to place each element at the right pixel position on a blank form image, not type into a field.) Verdict: it got everything on the paper. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code, all there. > Best character spacing I've seen yet: it actually calculates the gap between each character, clean across the DOB and number boxes > A few fields slightly misaligned, but every piece of data made it onto the form The reasoning chain is the interesting part: it does the easy fields first, then works into the tight one-char-per-box fields, reasoning through y-coordinates, baselines, and label clearance in obsessive detail. The cost: 40:33 and 126.7k output tokens. That's a long think — but it's MiniMax's first multimodal model, and it nailed the content.

stevibe

27,383 Aufrufe • vor 1 Monat

I just ran Gemma 4 31B on @CerebrasSystems at 1,800+ tokens/sec and it's multimodal. For context: that's 35x faster than a typical GPU endpoint, and the first token (reasoning included) lands in 1.5 seconds. This isn't a benchmark slide, I recorded the inference live. Prompt I used: "Create a simulation of an iPhone. Include at least one working dummy note taking app, a functional notification pulldown, high quality graphics, single HTML file, any libs via CDN." - Generation time: 3 seconds. - Notes app worked. - Notification panel worked. - Rendered first try. This is what wafer-scale inference unlocks, not just "faster," but a different category of product. When generation is this fast, you stop waiting and start iterating in real time. Why this matters: Gemma 4 31B is Google DeepMind's flagship open weight model, Apache 2.0 licensed, dense (not MoE), and built for efficiency over raw parameter count. It scores close to Claude Haiku 4.5 on the Artificial Analysis Intelligence Index (30 vs 29) but runs ~18x faster on Cerebras. It's also the first multimodal model on Cerebras's platform, meaning you can now feed it screenshots, documents, charts, and UI states at wafer scale speed. # Applications I'm most excited about: - Screenshot → Insight: Drop in a dashboard or document screenshot, get structured findings back instantly. no waiting, no batching. - Live UI generation: Full interactive interfaces (like my iPhone sim) generated and rendered in under 2 seconds. - Screenshot -> Patch: Feed it a broken UI + console error, get a minimal code fix and verification steps back. - Computer use & agentic loops: See -> reason -> act - verify, fast enough to keep a human in the loop instead of waiting on the model. - Long context summarization: Full research reports condensed into decision ready summaries you can read and requery in one sitting. The bigger unlock isn't the speed number itself, it's that agentic and multimodal loops (see -> reason -> output -> tool call -> verify -> retry) finally run in real time instead of feeling sluggish. As Logan Kilpatrick (Logan Kilpatrick) put it: "If every model was doing 2,000 tokens per second, you wouldn't build the same product and just have it be faster, you'd build different products." Gemma 4 31B is live now on Cerebras Inference Cloud in public preview. If you're building multimodal, agentic, or real time apps, this is worth testing today. What would you build with such insane inference throughput?

I just ran Gemma 4 31B on @CerebrasSystems at 1,800+ tokens/sec and it's multimodal. For context: that's 35x faster than a typical GPU endpoint, and the first token (reasoning included) lands in 1.5 seconds. This isn't a benchmark slide, I recorded the inference live. Prompt I used: "Create a simulation of an iPhone. Include at least one working dummy note taking app, a functional notification pulldown, high quality graphics, single HTML file, any libs via CDN." - Generation time: 3 seconds. - Notes app worked. - Notification panel worked. - Rendered first try. This is what wafer-scale inference unlocks, not just "faster," but a different category of product. When generation is this fast, you stop waiting and start iterating in real time. Why this matters: Gemma 4 31B is Google DeepMind's flagship open weight model, Apache 2.0 licensed, dense (not MoE), and built for efficiency over raw parameter count. It scores close to Claude Haiku 4.5 on the Artificial Analysis Intelligence Index (30 vs 29) but runs ~18x faster on Cerebras. It's also the first multimodal model on Cerebras's platform, meaning you can now feed it screenshots, documents, charts, and UI states at wafer scale speed. # Applications I'm most excited about: - Screenshot → Insight: Drop in a dashboard or document screenshot, get structured findings back instantly. no waiting, no batching. - Live UI generation: Full interactive interfaces (like my iPhone sim) generated and rendered in under 2 seconds. - Screenshot -> Patch: Feed it a broken UI + console error, get a minimal code fix and verification steps back. - Computer use & agentic loops: See -> reason -> act - verify, fast enough to keep a human in the loop instead of waiting on the model. - Long context summarization: Full research reports condensed into decision ready summaries you can read and requery in one sitting. The bigger unlock isn't the speed number itself, it's that agentic and multimodal loops (see -> reason -> output -> tool call -> verify -> retry) finally run in real time instead of feeling sluggish. As Logan Kilpatrick (Logan Kilpatrick) put it: "If every model was doing 2,000 tokens per second, you wouldn't build the same product and just have it be faster, you'd build different products." Gemma 4 31B is live now on Cerebras Inference Cloud in public preview. If you're building multimodal, agentic, or real time apps, this is worth testing today. What would you build with such insane inference throughput?

Alok

12,962 Aufrufe • vor 29 Tagen

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Alok

200,913 Aufrufe • vor 1 Monat

🧑‍🚀 Day 5 of the Cursor #vibejam Proudly sponsored by Cursor + bolt.new + GLIF Today a little bit about Cursor 3, it's a completely rebuilt interface designed around agents, you can run multiple AI agents in parallel, hand off between local and cloud, and go from commit to merged PR. It's powered by Composer 2, their own frontier coding model that tops benchmarks at $0.50/M input tokens — and the Fast variant runs at 200 tokens/sec, more than 2x faster than other frontier models. Perfect for vibe coding your game jam entry. And edwin from Cursor is here to help you if you get stuck! Lots of activity again today again on here, the best games (or well previews) I saw today: 🍣 Sushi Belt by steeno 🌊 Surfing Simulator by Erik (this one is a basic demo but very promising as it has realistic water physics) 🦆 Grand Quack Auto by kyzo (or as I like to call it Bali Simulator) 🏛️ Roman Combat Sim by terry trusner (continued from yesterday because it's had a lot of progress I think) YOU HAVE 24 DAYS LEFT! Reply in this thread with updates on your current games to share your progress, and add tag #vibejam so I see and can include you in the daily tweet There's $35,000 in prizes for you to win, see threads below for more info. The Gold prize is $20,000, bronze is $10,000 and silver is $5,000! Wanna to participate? You can still start today and submit your game any time before May 1!

🧑‍🚀 Day 5 of the Cursor #vibejam Proudly sponsored by Cursor + bolt.new + GLIF Today a little bit about Cursor 3, it's a completely rebuilt interface designed around agents, you can run multiple AI agents in parallel, hand off between local and cloud, and go from commit to merged PR. It's powered by Composer 2, their own frontier coding model that tops benchmarks at $0.50/M input tokens — and the Fast variant runs at 200 tokens/sec, more than 2x faster than other frontier models. Perfect for vibe coding your game jam entry. And edwin from Cursor is here to help you if you get stuck! Lots of activity again today again on here, the best games (or well previews) I saw today: 🍣 Sushi Belt by steeno 🌊 Surfing Simulator by Erik (this one is a basic demo but very promising as it has realistic water physics) 🦆 Grand Quack Auto by kyzo (or as I like to call it Bali Simulator) 🏛️ Roman Combat Sim by terry trusner (continued from yesterday because it's had a lot of progress I think) YOU HAVE 24 DAYS LEFT! Reply in this thread with updates on your current games to share your progress, and add tag #vibejam so I see and can include you in the daily tweet There's $35,000 in prizes for you to win, see threads below for more info. The Gold prize is $20,000, bronze is $10,000 and silver is $5,000! Wanna to participate? You can still start today and submit your game any time before May 1!

@levelsio

79,700 Aufrufe • vor 3 Monaten

I told you to claim your free 16GB NVIDIA GPU for learning Local LLMs. Now I’m going to show you how to double its inference speed without touching the hardware. Google Colab gives you an enterprise grade NVIDIA Tesla T4 GPU for free, roughly 4 hours every single day. It is the absolute perfect sandbox for learning AI engineering, testing inference flags, and pushing massive context windows. The local AI timeline is moving way too fast. If you aren't using Multi Token Prediction (MTP) yet, you are leaving massive performance on the table. I just pushed DeepMind’s Gemma 4 26B to 64.9 t/s on this exact free tier. Let's look at the raw benchmark data running on an Ubuntu Linux environment with the latest compiled llama.cpp binaries and quantized GGUFs from Unsloth via HuggingFace: # Qwen 3.5 9B (Dense): Base: [ Prompt: 626.7 t/s | Generation: 21.0 t/s ] With MTP: [ Prompt: 539.1 t/s | Generation: 24.8 t/s ] # Gemma 4 26B QAT (MoE): Base: [ Prompt: 634.2 t/s | Generation: 48.3 t/s ] With MTP: [ Prompt: 572.1 t/s | Generation: 64.9 t/s ] If you are paying attention, this single Colab notebook reveals 3 massive observations about the current state of local LLMs: # 1. The MTP Speedup (Software Overclocking) Standard autoregressive decoding guesses one token at a time. MTP acts like a highly optimized, built in speculative decoder. It predicts multiple future tokens at once and the main model verifies them in parallel. The result? Zero accuracy loss and a massive throughput increase. Gemma jumped from 48 to 65 t/s just by flipping a flag. # 2. The MoE Paradox (Bigger is Faster) How does a 26B parameter model absolutely destroy a 9B model in raw speed on the exact same hardware? Architecture. Qwen 3.5 9B is a dense model. it activates all 9 billion parameters for every single token. Gemma 4 26B is a Mixture of Experts (MoE) model. It routes data efficiently, activating only 4B parameters per token. You get the reasoning capabilities of a 26B model with the compute cost of a 4B model. 3. Thinking Efficiency When I ran the exact same complex prompt on both models, the larger MoE spent significantly fewer "thinking" tokens to arrive at the correct answer. A smarter model doesn't just give better answers; it gets to the point faster, saving you compute cycles and preserving your context window. # Want to run this yourself? Here are the exact llama.cpp CLI commands. For Qwen (MTP is baked into the main model): ./llama-cli -m Qwen3.5-9B-UD-Q4_K_XL.gguf -p "Explain quantum computing." -n 2000 -c 8000 -ngl 99 -fa on --spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-p-min 0.7 For Gemma (Using a separate lightweight draft model): ./llama-cli -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --model-draft mtp-gemma-4-26B-A4B-it.gguf -p "Explain quantum computing." -n 2000 -c 8000 -ngl 99 -fa on --spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-p-min 0.7 Stop waiting for a $3,000 rig. Boot up Colab, pull these models, and start building your stack. I’ve put together a completely free, cell by cell Google Colab notebook that automates this entire workflow so you can test it yourself in 5 minutes and learn. Link to the notebook is in the comments below. Experiemt with different MTP parameters, context windows and post your results in the comments.

I told you to claim your free 16GB NVIDIA GPU for learning Local LLMs. Now I’m going to show you how to double its inference speed without touching the hardware. Google Colab gives you an enterprise grade NVIDIA Tesla T4 GPU for free, roughly 4 hours every single day. It is the absolute perfect sandbox for learning AI engineering, testing inference flags, and pushing massive context windows. The local AI timeline is moving way too fast. If you aren't using Multi Token Prediction (MTP) yet, you are leaving massive performance on the table. I just pushed DeepMind’s Gemma 4 26B to 64.9 t/s on this exact free tier. Let's look at the raw benchmark data running on an Ubuntu Linux environment with the latest compiled llama.cpp binaries and quantized GGUFs from Unsloth via HuggingFace: # Qwen 3.5 9B (Dense): Base: [ Prompt: 626.7 t/s | Generation: 21.0 t/s ] With MTP: [ Prompt: 539.1 t/s | Generation: 24.8 t/s ] # Gemma 4 26B QAT (MoE): Base: [ Prompt: 634.2 t/s | Generation: 48.3 t/s ] With MTP: [ Prompt: 572.1 t/s | Generation: 64.9 t/s ] If you are paying attention, this single Colab notebook reveals 3 massive observations about the current state of local LLMs: # 1. The MTP Speedup (Software Overclocking) Standard autoregressive decoding guesses one token at a time. MTP acts like a highly optimized, built in speculative decoder. It predicts multiple future tokens at once and the main model verifies them in parallel. The result? Zero accuracy loss and a massive throughput increase. Gemma jumped from 48 to 65 t/s just by flipping a flag. # 2. The MoE Paradox (Bigger is Faster) How does a 26B parameter model absolutely destroy a 9B model in raw speed on the exact same hardware? Architecture. Qwen 3.5 9B is a dense model. it activates all 9 billion parameters for every single token. Gemma 4 26B is a Mixture of Experts (MoE) model. It routes data efficiently, activating only 4B parameters per token. You get the reasoning capabilities of a 26B model with the compute cost of a 4B model. 3. Thinking Efficiency When I ran the exact same complex prompt on both models, the larger MoE spent significantly fewer "thinking" tokens to arrive at the correct answer. A smarter model doesn't just give better answers; it gets to the point faster, saving you compute cycles and preserving your context window. # Want to run this yourself? Here are the exact llama.cpp CLI commands. For Qwen (MTP is baked into the main model): ./llama-cli -m Qwen3.5-9B-UD-Q4_K_XL.gguf -p "Explain quantum computing." -n 2000 -c 8000 -ngl 99 -fa on --spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-p-min 0.7 For Gemma (Using a separate lightweight draft model): ./llama-cli -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --model-draft mtp-gemma-4-26B-A4B-it.gguf -p "Explain quantum computing." -n 2000 -c 8000 -ngl 99 -fa on --spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-p-min 0.7 Stop waiting for a $3,000 rig. Boot up Colab, pull these models, and start building your stack. I’ve put together a completely free, cell by cell Google Colab notebook that automates this entire workflow so you can test it yourself in 5 minutes and learn. Link to the notebook is in the comments below. Experiemt with different MTP parameters, context windows and post your results in the comments.

Alok

170,442 Aufrufe • vor 16 Tagen

Open source AI is actually moving at an unhinged pace right now. I literally hadn't even finished typing up my last Gemma 4 12b benchmark notes before Google went ahead and dropped the official Quantization Aware Training (QAT) checkpoints on Hugging Face. If you missed the news, QAT basically bakes the compression directly into the training process. Instead of standard post training quantization degrading the model's reasoning capabilities, QAT trains the model with compression in mind. Unsloth is reporting near original performance at 4-bit with ~72% lower memory footprint. Details in the comments. Naturally, had to instantly pull the new GGUFs to see what a single RTX 4090 card (24 GB VRAM, Cuda 12.8, ubuntu 22) could do. i fired up llama.cpp engine again Look at these numbers: 1. Unsloth Gemma 4 26B-A4B IT (QAT Q4_K_XL) flags: ./build/bin/llama-cli -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 250000 -fa on -v VRAM Used: 19.5 GB context: 250,000 tokens decode throughput: 193 tps 2. Unsloth Gemma 4 31B IT (QAT Q4_K_XL) flags: Command: ./build/bin/llama-cli -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 60000 -fa on -v - VRAM Used: 23 GB (Tight, but zero system RAM spillover) - context: 60,000 tokens - decode throughput: 47 tps We are essentially watching hardware bottlenecks evaporate in real time. An update literally drops before you can finish benchmarking the previous one. What a time to be running local hardware. If you have a single rtx 3090, rtx 4090, these are the latest gemma models to try this week.

Open source AI is actually moving at an unhinged pace right now. I literally hadn't even finished typing up my last Gemma 4 12b benchmark notes before Google went ahead and dropped the official Quantization Aware Training (QAT) checkpoints on Hugging Face. If you missed the news, QAT basically bakes the compression directly into the training process. Instead of standard post training quantization degrading the model's reasoning capabilities, QAT trains the model with compression in mind. Unsloth is reporting near original performance at 4-bit with ~72% lower memory footprint. Details in the comments. Naturally, had to instantly pull the new GGUFs to see what a single RTX 4090 card (24 GB VRAM, Cuda 12.8, ubuntu 22) could do. i fired up llama.cpp engine again Look at these numbers: 1. Unsloth Gemma 4 26B-A4B IT (QAT Q4_K_XL) flags: ./build/bin/llama-cli -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 250000 -fa on -v VRAM Used: 19.5 GB context: 250,000 tokens decode throughput: 193 tps 2. Unsloth Gemma 4 31B IT (QAT Q4_K_XL) flags: Command: ./build/bin/llama-cli -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 60000 -fa on -v - VRAM Used: 23 GB (Tight, but zero system RAM spillover) - context: 60,000 tokens - decode throughput: 47 tps We are essentially watching hardware bottlenecks evaporate in real time. An update literally drops before you can finish benchmarking the previous one. What a time to be running local hardware. If you have a single rtx 3090, rtx 4090, these are the latest gemma models to try this week.

Alok

26,841 Aufrufe • vor 1 Monat

I just got Gemma 4 26B A4B MoE model running fully locally with Hermes agent on an 8GB RTX 4060 and it's now backtesting trading strategies end to end, no hand holding. If you’re a trader or work on Wall Street, you don’t want to miss this. Yes. fully automated. No cloud. No APIs beyond market data. # Here's what I did: Setup: - Model: Gemma 4 26B-A4B QAT (MoE), Q4_K_XL Unsloth's quant (link in the comments) - Inference: llama.cpp (turboquant fork by Tom Turney link in the comments) - Hardware: RTX 4060, 8GB VRAM + 16GB RAM only (with 50 other chrome tabs open) - Context: 64K llama.cpp turboquant flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 turboquant helps achieve high prefill and decode throughput for interactive sessions. throughput with Hermes agent: decode: 25+ tokens/sec prefill: 250+ tokens/sec # Then I gave the agent one task: Backtest a strategy: - Buy when RSI crosses above 30 - Sell at +2% profit or -1% stoploss - No overlapping positions - Use Google stock via yfinance - Generate a full HTML report with candlestick charts + signals What happened next was wild. It didn't just write code, it ran the entire workflow itself: Audited the environment (pip list, dependency check) Hit a ModuleNotFoundError, multiple Python installs were conflicting Ran where python to map every interpreter on the system Manually selected the correct Python 3.13 path and re ran the script Wrote a clean statevmachine backtester (strict no overlapping trades logic) Patched a yfinance MultiIndex quirk that would've crashed the script Built Plotly candlestick + RSI charts with buy/sell markers Calculated win rate, PnL, and summary stats Exported a polished single file HTML report. check the report at the end of the video or in the comments. Biggest takeaway: local LLMs aren't just "chat assistants" anymore. They debug their own environment, write production code, and ship a finished deliverable on consumer hardware, for $0 in API costs. If you're still calling local models "toys," you're already behind. This is just the beginning. Hermes agent just surpassed 1 trillion tokens in a single day on OpenRouter. Think about the scale of total token generation happening right now. Disclaimer: This is not financial advice. Consult a professional before making any trading decisions.

I just got Gemma 4 26B A4B MoE model running fully locally with Hermes agent on an 8GB RTX 4060 and it's now backtesting trading strategies end to end, no hand holding. If you’re a trader or work on Wall Street, you don’t want to miss this. Yes. fully automated. No cloud. No APIs beyond market data. # Here's what I did: Setup: - Model: Gemma 4 26B-A4B QAT (MoE), Q4_K_XL Unsloth's quant (link in the comments) - Inference: llama.cpp (turboquant fork by Tom Turney link in the comments) - Hardware: RTX 4060, 8GB VRAM + 16GB RAM only (with 50 other chrome tabs open) - Context: 64K llama.cpp turboquant flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 turboquant helps achieve high prefill and decode throughput for interactive sessions. throughput with Hermes agent: decode: 25+ tokens/sec prefill: 250+ tokens/sec # Then I gave the agent one task: Backtest a strategy: - Buy when RSI crosses above 30 - Sell at +2% profit or -1% stoploss - No overlapping positions - Use Google stock via yfinance - Generate a full HTML report with candlestick charts + signals What happened next was wild. It didn't just write code, it ran the entire workflow itself: Audited the environment (pip list, dependency check) Hit a ModuleNotFoundError, multiple Python installs were conflicting Ran where python to map every interpreter on the system Manually selected the correct Python 3.13 path and re ran the script Wrote a clean statevmachine backtester (strict no overlapping trades logic) Patched a yfinance MultiIndex quirk that would've crashed the script Built Plotly candlestick + RSI charts with buy/sell markers Calculated win rate, PnL, and summary stats Exported a polished single file HTML report. check the report at the end of the video or in the comments. Biggest takeaway: local LLMs aren't just "chat assistants" anymore. They debug their own environment, write production code, and ship a finished deliverable on consumer hardware, for $0 in API costs. If you're still calling local models "toys," you're already behind. This is just the beginning. Hermes agent just surpassed 1 trillion tokens in a single day on OpenRouter. Think about the scale of total token generation happening right now. Disclaimer: This is not financial advice. Consult a professional before making any trading decisions.

Alok

104,670 Aufrufe • vor 1 Monat