Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).

Awni Hannun

43,926 subscribers

35,078 views • 6 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Day 11/90 of Inference Engineering How does vLLM work and how is it used in production? Before we discuss how vLLM works internally, it helps to understand what vLLM is. At a high level, vLLM is an inference engine that is designed to serve LLMs to thousands of concurrent users efficiently while managing scarce compute and memory. The goal for vLLM is to maximize throughput and minimize latency; optimizing for the best inference economics and experience for end users. With every request from the end user, it eventually ends up in the engine core, gets scheduled alongside other requests from other concurrent users, executes on the GPU, and updates the KV cache with the new key and value vectors, and streams the tokens back to the user. The Scheduler decides what requests should execute next while continuously batching requests together to maximize GPU utilization. Continuous batching is an inference optimization that allows new requests to join a running batch as other requests finish generating tokens. This helps with keeping the GPU utilization high instead of letting it sit idle waiting for an entire batch to complete generating. After the scheduler dispatches the selected batch to the Model Executor, the Model Executor prepares the tensors and metadata required for inference, retrieves each request’s block table from KV Cache Manager, launches the optimized transformer forward pass on the GPU, computes the logits, updates the KV cache with the new key and value vectors, and finally returns the results for sampling and streaming. The KV Cache Manager uses the PagedAttention memory layout to allocate fixed-size cache blocks on demand and maintains a Free Block Queue on the CPU that tracks which blocks in the GPU’s Paged KV Cache are currently free. When a request needs additional KV cache space, the KV Cache manager takes a free block from the queue and assigns it to that request, thus avoiding an expensive search through GPU memory for available cache blocks. All of these components form the core of vLLM’s inference engine. The Scheduler determines what requests are executed, the Model Executor determines how those requests are executed, the KV Cache Manager determines where each request’s KV cache lives using the PagedAttention Memory Layout. This architecture enables vLLM to serve thousands of concurrent requests with high throughput, low latency, and efficient GPU memory utilization. Heres a little animation that visualizes everything! - I've also completed the forward pass for my mnist.c project. I had a nice chat with shrey birmiwal, such a knowledgeable guy. Excited to learn more about vLLM and implement a tiny-vLLM one day.

Day 11/90 of Inference Engineering How does vLLM work and how is it used in production? Before we discuss how vLLM works internally, it helps to understand what vLLM is. At a high level, vLLM is an inference engine that is designed to serve LLMs to thousands of concurrent users efficiently while managing scarce compute and memory. The goal for vLLM is to maximize throughput and minimize latency; optimizing for the best inference economics and experience for end users. With every request from the end user, it eventually ends up in the engine core, gets scheduled alongside other requests from other concurrent users, executes on the GPU, and updates the KV cache with the new key and value vectors, and streams the tokens back to the user. The Scheduler decides what requests should execute next while continuously batching requests together to maximize GPU utilization. Continuous batching is an inference optimization that allows new requests to join a running batch as other requests finish generating tokens. This helps with keeping the GPU utilization high instead of letting it sit idle waiting for an entire batch to complete generating. After the scheduler dispatches the selected batch to the Model Executor, the Model Executor prepares the tensors and metadata required for inference, retrieves each request’s block table from KV Cache Manager, launches the optimized transformer forward pass on the GPU, computes the logits, updates the KV cache with the new key and value vectors, and finally returns the results for sampling and streaming. The KV Cache Manager uses the PagedAttention memory layout to allocate fixed-size cache blocks on demand and maintains a Free Block Queue on the CPU that tracks which blocks in the GPU’s Paged KV Cache are currently free. When a request needs additional KV cache space, the KV Cache manager takes a free block from the queue and assigns it to that request, thus avoiding an expensive search through GPU memory for available cache blocks. All of these components form the core of vLLM’s inference engine. The Scheduler determines what requests are executed, the Model Executor determines how those requests are executed, the KV Cache Manager determines where each request’s KV cache lives using the PagedAttention Memory Layout. This architecture enables vLLM to serve thousands of concurrent requests with high throughput, low latency, and efficient GPU memory utilization. Heres a little animation that visualizes everything! - I've also completed the forward pass for my mnist.c project. I had a nice chat with shrey birmiwal, such a knowledgeable guy. Excited to learn more about vLLM and implement a tiny-vLLM one day.

max fu

69,270 views • 8 days ago

Before the week ends, let's acknowledge one of the most INSANE week ever for open AI, with 25+ notable open-weight drops across every modality: 🧠 LLMs → NVIDIA Nemotron 3 Ultra: 550B hybrid Mamba-MoE, only 55B active, 1M context, MMLU 89.1. NVFP4 variant claims ~5x throughput on Blackwell. First openly-weighted 550B hybrid Mamba-Transformer, closing the gap with frontier closed models. → Google Gemma 4 12B: fully open dense any-to-any (text/image/audio/video), 256k context, encoder-free, 140+ languages, AIME 2026 at 77.5. Shipped with a 23-checkpoint QAT wave (mobile ONNX + MLX). Most deployable model of the week. → StepFun Step-3.7-Flash: 198B sparse MoE VLM, ~11B active, SWE-Bench PRO 56.3. Apache 2.0. → Liquid AI LFM2.5-8B-A1B: edge MoE, just 1.5B active, 128k ctx, MATH500 88.8, MLX-ready. Best on-device option this week. → JetBrains Mellum2-12B-A2.5B-Thinking: their first open MoE, near-Qwen3-14B coding at 2.5B active. Apache 2.0. 🎨 Image gen (the surprise of the week) → Ideogram 4: their FIRST-EVER open weights. 9.3B flow-matching DiT trained from scratch. #2 overall behind GPT Image 2, top open-weight model on Design Arena + LMArena. Strongest open checkpoint for text-rich images, full stop. It has taste. Still can't believe this is open weights. 🔊 Audio & Speech (a breakout week for open TTS, 4 labs shipped) → Boson Higgs Audio v3 4B: 102 languages, 21 emotions, singing/whispering/shouting, sub-second TTFA. → RedNote dots.tts: the only fully continuous (no codec) open TTS pipeline, Apache 2.0. → Google Magenta RealTime 2: real-time music gen, <200ms latency, text+audio+MIDI. multimodalart ported it to PyTorch within hours with live ZeroGPU demos. → NVIDIA Nemotron-3.5 ASR: 600M streaming, 17x more concurrent streams vs Parakeet RNNT 1.1B. 👁️ Vision & VLMs → PaddleOCR-VL-1.6: SOTA document parsing at 1B params, Apache 2.0. → Baidu NAVA: 6.3B joint audio-video gen, best-in-class A/V sync, Apache 2.0. 🎬 Video, 3D & World Models → NVIDIA Cosmos3-Super: 64B omnimodal world model coupling action trajectories with video+audio gen, for Physical AI. → JD JoyAI-Echo: up to 5-min multi-shot text-to-video on LTX-2.3. → ByteDance Bernini-R + VAST TripoSplat (single-image-to-3D Gaussian splats, MIT).

Before the week ends, let's acknowledge one of the most INSANE week ever for open AI, with 25+ notable open-weight drops across every modality: 🧠 LLMs → NVIDIA Nemotron 3 Ultra: 550B hybrid Mamba-MoE, only 55B active, 1M context, MMLU 89.1. NVFP4 variant claims ~5x throughput on Blackwell. First openly-weighted 550B hybrid Mamba-Transformer, closing the gap with frontier closed models. → Google Gemma 4 12B: fully open dense any-to-any (text/image/audio/video), 256k context, encoder-free, 140+ languages, AIME 2026 at 77.5. Shipped with a 23-checkpoint QAT wave (mobile ONNX + MLX). Most deployable model of the week. → StepFun Step-3.7-Flash: 198B sparse MoE VLM, ~11B active, SWE-Bench PRO 56.3. Apache 2.0. → Liquid AI LFM2.5-8B-A1B: edge MoE, just 1.5B active, 128k ctx, MATH500 88.8, MLX-ready. Best on-device option this week. → JetBrains Mellum2-12B-A2.5B-Thinking: their first open MoE, near-Qwen3-14B coding at 2.5B active. Apache 2.0. 🎨 Image gen (the surprise of the week) → Ideogram 4: their FIRST-EVER open weights. 9.3B flow-matching DiT trained from scratch. #2 overall behind GPT Image 2, top open-weight model on Design Arena + LMArena. Strongest open checkpoint for text-rich images, full stop. It has taste. Still can't believe this is open weights. 🔊 Audio & Speech (a breakout week for open TTS, 4 labs shipped) → Boson Higgs Audio v3 4B: 102 languages, 21 emotions, singing/whispering/shouting, sub-second TTFA. → RedNote dots.tts: the only fully continuous (no codec) open TTS pipeline, Apache 2.0. → Google Magenta RealTime 2: real-time music gen, <200ms latency, text+audio+MIDI. multimodalart ported it to PyTorch within hours with live ZeroGPU demos. → NVIDIA Nemotron-3.5 ASR: 600M streaming, 17x more concurrent streams vs Parakeet RNNT 1.1B. 👁️ Vision & VLMs → PaddleOCR-VL-1.6: SOTA document parsing at 1B params, Apache 2.0. → Baidu NAVA: 6.3B joint audio-video gen, best-in-class A/V sync, Apache 2.0. 🎬 Video, 3D & World Models → NVIDIA Cosmos3-Super: 64B omnimodal world model coupling action trajectories with video+audio gen, for Physical AI. → JD JoyAI-Echo: up to 5-min multi-shot text-to-video on LTX-2.3. → ByteDance Bernini-R + VAST TripoSplat (single-image-to-3D Gaussian splats, MIT).

Victor M

539,522 views • 1 month ago

Max B Released From Prison After 16 Years After being locked up for over 15 years, Max B is finally free. He's hitting the ground running. He already linked up with French Montana, popped up on the field at the Jets game, and he's got a welcome home party in NYC tonight, followed by a string of other events on the East Coast. "I got one mission: to restore musical excellence," he told Okayplayer this week. "It's brilliant. That's all I'm going to tell you."

Max B Released From Prison After 16 Years After being locked up for over 15 years, Max B is finally free. He's hitting the ground running. He already linked up with French Montana, popped up on the field at the Jets game, and he's got a welcome home party in NYC tonight, followed by a string of other events on the East Coast. "I got one mission: to restore musical excellence," he told Okayplayer this week. "It's brilliant. That's all I'm going to tell you."

Pigeons & Planes

23,697 views • 8 months ago

Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM: Prefill: 1000+ tok/s (42% increase) Decode: 25+ tok/s (25% increase) Context: 120k (150% increase) prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments) llama.cpp TurboQuant flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --port 8080 tested with a 27k prompt, 120k context loaded. -ngl 99 here isn't a typo, full 12B dense, every layer on GPU, on an 8GB card. that's the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card. TurboQuant's KV cache savings are what free up the room to do that at 120k context. side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+ rig: RTX 4060 8GB · i7H · 16GB RAM same two flags as yesterday, different model size: --cache-type-k q8_0 --cache-type-v turbo3 thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (Tom Turney) to make this work. unsloth's model quant huggingface and the llama.cpp fork github link in the comments Do you prefer a dense or a MoE for your 8GB card?

Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM: Prefill: 1000+ tok/s (42% increase) Decode: 25+ tok/s (25% increase) Context: 120k (150% increase) prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments) llama.cpp TurboQuant flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --port 8080 tested with a 27k prompt, 120k context loaded. -ngl 99 here isn't a typo, full 12B dense, every layer on GPU, on an 8GB card. that's the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card. TurboQuant's KV cache savings are what free up the room to do that at 120k context. side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+ rig: RTX 4060 8GB · i7H · 16GB RAM same two flags as yesterday, different model size: --cache-type-k q8_0 --cache-type-v turbo3 thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (Tom Turney) to make this work. unsloth's model quant huggingface and the llama.cpp fork github link in the comments Do you prefer a dense or a MoE for your 8GB card?

Alok

34,500 views • 1 month ago

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Alok

292,096 views • 1 month ago

AI in robotics gets all the attention right now, but sometimes the most interesting work is very practical. Viet built a small vision system that counts potatoes on a conveyor belt. No giant dataset. No huge model. Just a clear problem and a smart setup. He used Ultralytics’ ObjectCounter, trained a tiny YOLO11 nano model, and because there was no potato dataset, he annotated a single frame with SAM 2 and trained from that. One frame. Still works across the whole video. It is a good reminder that useful AI in industry often looks like this. Focused. Lightweight. Solves a real task. If you work in manufacturing or robotics, these small systems are usually the fastest wins. They save time, reduce errors, and do not need massive infrastructure. Nice work, Viet. His projects: —- Weekly robotics and AI insights. Subscribe free:

AI in robotics gets all the attention right now, but sometimes the most interesting work is very practical. Viet built a small vision system that counts potatoes on a conveyor belt. No giant dataset. No huge model. Just a clear problem and a smart setup. He used Ultralytics’ ObjectCounter, trained a tiny YOLO11 nano model, and because there was no potato dataset, he annotated a single frame with SAM 2 and trained from that. One frame. Still works across the whole video. It is a good reminder that useful AI in industry often looks like this. Focused. Lightweight. Solves a real task. If you work in manufacturing or robotics, these small systems are usually the fastest wins. They save time, reduce errors, and do not need massive infrastructure. Nice work, Viet. His projects: —- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

1,674,988 views • 8 months ago

IT'S MONDAY MORNING. YOU'RE ASLEEP Your trading agent has already opened three positions, adjusted risk exposure twice, and closed a profitable trade. Before you reached for your phone. $100,000 simulated capital. 72 hours of full autonomy. $6,000 prize pool. Zero coding required > no-code agent builder, if you can describe how you think about markets, Bloome builds the agent > four specialized agents running simultaneously: research, technical analysis, risk management, execution > scoring rewards discipline, a 7% return with clean risk management beats a 14% return with chaos > 1,000 invite slots, competition activates at 50 participants, the field is still small right now > first place: $3,000 · second: $2,000 · third: $1,000 · payouts go out June 12 this isn't a trading bot competition it's a test of whether you can think systematically about markets and encode that thinking into an autonomous system the old model: sit at a screen, make emotional decisions, compete against algorithms that never sleep the new model: build the algorithm → let it run → go to sleep

IT'S MONDAY MORNING. YOU'RE ASLEEP Your trading agent has already opened three positions, adjusted risk exposure twice, and closed a profitable trade. Before you reached for your phone. $100,000 simulated capital. 72 hours of full autonomy. $6,000 prize pool. Zero coding required > no-code agent builder, if you can describe how you think about markets, Bloome builds the agent > four specialized agents running simultaneously: research, technical analysis, risk management, execution > scoring rewards discipline, a 7% return with clean risk management beats a 14% return with chaos > 1,000 invite slots, competition activates at 50 participants, the field is still small right now > first place: $3,000 · second: $2,000 · third: $1,000 · payouts go out June 12 this isn't a trading bot competition it's a test of whether you can think systematically about markets and encode that thinking into an autonomous system the old model: sit at a screen, make emotional decisions, compete against algorithms that never sleep the new model: build the algorithm → let it run → go to sleep

Shadow Nick

29,119 views • 1 month ago

A CHINESE GUY PUT 4 MINISFORUM MS-S1 MAX MINI PCs IN HIS BEDROOM AND TURNED THEM INTO A 24/7 AI AGENT CLUSTER. TOTAL POWER BILL: ABOUT $44/MO. each box is a tiny local AI workstation built around the Ryzen AI Max+ 395. around $3,000 per unit gets him 128GB of unified memory, 2TB storage, dual 10GbE, and up to roughly 96GB usable as VRAM on Linux. one MS-S1 Max can already run serious open models without touching the cloud. Qwen3-Coder 30B for fast coding, Llama 3.3 70B for heavier reasoning, and larger research models overnight when speed matters less than free inference. four boxes in one room changes the whole game. he is not opening a chatbot, paying for every loop, or shutting agents down before sleep. this is private infrastructure that keeps working even when he is offline. the agents can sort inboxes, review code, summarize documents, monitor feeds, prep meetings, and read papers overnight. on cloud APIs, that kind of always-on stack can easily burn $800 to $1,200 a month if used aggressively. his setup is roughly a $12,000 hardware spend, but the monthly cost is basically electricity. a rack, a switch, a NAS, a small monitor, and four tiny MS-S1 Max boxes turning a bedroom corner into a private inference factory. this is what AI looks like when it stops being rented and starts becoming something you own.

A CHINESE GUY PUT 4 MINISFORUM MS-S1 MAX MINI PCs IN HIS BEDROOM AND TURNED THEM INTO A 24/7 AI AGENT CLUSTER. TOTAL POWER BILL: ABOUT $44/MO. each box is a tiny local AI workstation built around the Ryzen AI Max+ 395. around $3,000 per unit gets him 128GB of unified memory, 2TB storage, dual 10GbE, and up to roughly 96GB usable as VRAM on Linux. one MS-S1 Max can already run serious open models without touching the cloud. Qwen3-Coder 30B for fast coding, Llama 3.3 70B for heavier reasoning, and larger research models overnight when speed matters less than free inference. four boxes in one room changes the whole game. he is not opening a chatbot, paying for every loop, or shutting agents down before sleep. this is private infrastructure that keeps working even when he is offline. the agents can sort inboxes, review code, summarize documents, monitor feeds, prep meetings, and read papers overnight. on cloud APIs, that kind of always-on stack can easily burn $800 to $1,200 a month if used aggressively. his setup is roughly a $12,000 hardware spend, but the monthly cost is basically electricity. a rack, a switch, a NAS, a small monitor, and four tiny MS-S1 Max boxes turning a bedroom corner into a private inference factory. this is what AI looks like when it stops being rented and starts becoming something you own.

Gipp 🦅

24,836 views • 1 month ago

you're paying $20/mo for something your $500 GPU can already do. Gemma 4 26B A4B QAT MoE + Hermes Agent running on a single RTX 4060 (8GB VRAM). Built a vision capable, 100% free, 100% local, private AI assistant that lives in my Chrome browser. No API keys. No cloud. No subscriptions. 100% vibe coded. 0% handholding. It has full context of whatever's on my screen can answer questions, summarize pages, extract data, and see images. Same local model handles everything, no external calls, ever. keep reading for the model and hermes agent tips i learnt while building this locally. Here's the exact setup for anyone running local LLMs on 6-8 GB VRAM: llama.cpp server flags (on my NVIDIA RTX 4060 8gb VRAM): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --cache-type-k q8_0 --cache-type-v q8_0 -c 150000 --port 8080 Throughput with quantization: Prefill: 200-250 tokens/sec Decode: 20-25 tokens/sec reduce context if oom on 6 gb vram card. Key learnings: - Quantize KV cache to q8 for faster prefill/decode. Prefill goes from 100-150 (unquantized) to 200-250 tok/s (q8). - But watch out, once actual context grows past ~50k tokens on high entropy workloads, q8 KV quantization can cause hallucinations. Low entropy workloads are mostly unaffected. If you see it happening, drop the quantization. This is common across all local models. - In Hermes Agent settings -> Memory & Context, bump compression threshold from default 0.5 to 0.7. Default triggers way too frequent context compression and eats time. Up next: add persistent memory, web search, tool calling, streaming output and whatever you suggest. Running a 26B MoE with vision + 150k context window on 8GB VRAM would've sounded impossible 6 months ago. Works the same on the NVIDIA RTX 3060 Ti, 3070, 4060 Ti, 5060, 2080, or any 8GB card. VRAM is the only requirement. Local AI agents are closer than people think. You just need to know where the knobs are. Model's Unsloth quant hugging face link in the comments. Have you tried Hermes agent by Nous Research yet? What are you building with local LLMs? Drop it below, let's see what this community is shipping.

you're paying $20/mo for something your $500 GPU can already do. Gemma 4 26B A4B QAT MoE + Hermes Agent running on a single RTX 4060 (8GB VRAM). Built a vision capable, 100% free, 100% local, private AI assistant that lives in my Chrome browser. No API keys. No cloud. No subscriptions. 100% vibe coded. 0% handholding. It has full context of whatever's on my screen can answer questions, summarize pages, extract data, and see images. Same local model handles everything, no external calls, ever. keep reading for the model and hermes agent tips i learnt while building this locally. Here's the exact setup for anyone running local LLMs on 6-8 GB VRAM: llama.cpp server flags (on my NVIDIA RTX 4060 8gb VRAM): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --cache-type-k q8_0 --cache-type-v q8_0 -c 150000 --port 8080 Throughput with quantization: Prefill: 200-250 tokens/sec Decode: 20-25 tokens/sec reduce context if oom on 6 gb vram card. Key learnings: - Quantize KV cache to q8 for faster prefill/decode. Prefill goes from 100-150 (unquantized) to 200-250 tok/s (q8). - But watch out, once actual context grows past ~50k tokens on high entropy workloads, q8 KV quantization can cause hallucinations. Low entropy workloads are mostly unaffected. If you see it happening, drop the quantization. This is common across all local models. - In Hermes Agent settings -> Memory & Context, bump compression threshold from default 0.5 to 0.7. Default triggers way too frequent context compression and eats time. Up next: add persistent memory, web search, tool calling, streaming output and whatever you suggest. Running a 26B MoE with vision + 150k context window on 8GB VRAM would've sounded impossible 6 months ago. Works the same on the NVIDIA RTX 3060 Ti, 3070, 4060 Ti, 5060, 2080, or any 8GB card. VRAM is the only requirement. Local AI agents are closer than people think. You just need to know where the knobs are. Model's Unsloth quant hugging face link in the comments. Have you tried Hermes agent by Nous Research yet? What are you building with local LLMs? Drop it below, let's see what this community is shipping.

Alok

36,031 views • 23 days ago

$“Fire to the oligarchy!” implying the entire network of the sole oligarch in Georgia, the dictator himself. Day 70 of large-scale, continuous protests in Georgia. Day 100 overall since the rigged elections. Day 337 of various forms and stages of constant resistance against the regime, since their second tabling of the Russian law on April 3, 2024. The new elections demand is rapidly transforming into “finish with the regime” and most of our partners still have not voiced formal support for new elections that stems from four factors: fraudulent elections; U-turn in foreign policy, in a plot twist to the regime’s own supporters; systemic torture of citizens; an ongoing crisis that will not end. Democracies and even hybrid systems schedule elections for a fraction of this, don’t they? 📷 Nestan Nene Kvinikadze$

“Fire to the oligarchy!” implying the entire network of the sole oligarch in Georgia, the dictator himself. Day 70 of large-scale, continuous protests in Georgia. Day 100 overall since the rigged elections. Day 337 of various forms and stages of constant resistance against the regime, since their second tabling of the Russian law on April 3, 2024. The new elections demand is rapidly transforming into “finish with the regime” and most of our partners still have not voiced formal support for new elections that stems from four factors: fraudulent elections; U-turn in foreign policy, in a plot twist to the regime’s own supporters; systemic torture of citizens; an ongoing crisis that will not end. Democracies and even hybrid systems schedule elections for a fraction of this, don’t they? 📷 Nestan Nene Kvinikadze

Marika Mikiashvili 🇬🇪🇺🇦🇪🇺

28,394 views • 1 year ago

3 weeks since ml-intern launched and we just hit 1M messages exchanged. that's 3.3 agent-years of ML research in 21 days. 2 months worth of research every day. 17,383 training jobs total. talk about AI acceleration. here's some of what people built: Carlos Miguel Patiño replicated the full DeepSeek v4 architecture and pre+post trained a 100M MoE from scratch. → it landed a third place submission on Keller Jordan optimizer competition. autoresearch on SOTA territory. Lewis Tunstall Got the intern to convert Alec Radford's cool new talkie-lm 1930 model to work with transformers. tokenizer, chat template, model conversion etc all one-shotted by ml-intern. someone created entire PhD dissertation chapter on context-aware agentic cyber defense drafted with 16 research subagents. and someone used it to crack an Paul Jankura kernel optimization take-home. (we don't know how to feel about this one 👀 ) just getting started →

3 weeks since ml-intern launched and we just hit 1M messages exchanged. that's 3.3 agent-years of ML research in 21 days. 2 months worth of research every day. 17,383 training jobs total. talk about AI acceleration. here's some of what people built: Carlos Miguel Patiño replicated the full DeepSeek v4 architecture and pre+post trained a 100M MoE from scratch. → it landed a third place submission on Keller Jordan optimizer competition. autoresearch on SOTA territory. Lewis Tunstall Got the intern to convert Alec Radford's cool new talkie-lm 1930 model to work with transformers. tokenizer, chat template, model conversion etc all one-shotted by ml-intern. someone created entire PhD dissertation chapter on context-aware agentic cyber defense drafted with 16 research subagents. and someone used it to crack an Paul Jankura kernel optimization take-home. (we don't know how to feel about this one 👀 ) just getting started →

Aksel

35,091 views • 2 months ago

🚨 Anthropic committed up to 1M TPU chips for Claude. Openai is leasing TPUs for chatgpt inference. Here's How kernels work on TPUs (deep dive 2/6 by emi) pallas is Google's answer to kernel writing. a python kernel SDK built on JAX. still very experimental (jax.experimental.pallas). on TPU it compiles through mosaic; on GPU it lowers to triton. if you know CUDA, the syntax will feel familiar but the execution model is completely different. in CUDA, grid=(4,4) launches 16 blocks running simultaneously across SMs. in pallas, those 16 iterations run one after another in lexicographic order. no threads. no warps. no blocks. no occupancy tuning. a TPU is a sequential machine with a very wide vector register — more like a CPU than a GPU. performance comes from width: a 128x128 systolic array doing matmul and an 8x128 SIMD vector unit doing everything else. maximum parallelism on chip: 2, one per TensorCore in megacore mode. three concepts replace CUDA's thread/block/grid hierarchy. Refs are mutable memory references. because execution is sequential, each iteration safely accumulates without atomics. in CUDA you'd need atomics or a separate reduction pass. the memory model is also very different from NVIDIA's. zero hardware caches. VMEM is 32-128 MiB of software-managed scratchpad — 500-1000x larger than GPU shared memory per SM. all data must be explicitly DMA'd from HBM to VMEM before any computation touches it. four levels: HBM → VMEM → VREGs → MXU/VPU, plus SMEM for scalar control data. every byte of data movement is your responsibility. this is like CUDA shared memory except it's 500x bigger and there's no cache fallback. pipelining is mandatory. without double-buffering HBM→VMEM transfers, the MXU just stalls waiting for data. this is the single most important optimization on TPU. and because grid execution is sequential and deterministic, consecutive iterations that need the same input block skip the redundant HBM transfer automatically, impossible on GPU where block execution order is undefined. the compilation pipeline is unlike anything in this series: python → jaxpr → stableHLO → XLA HLO (71+ optimization passes) → LLO (78+ passes) → 322-bit VLIW bundles. the compiler packs instructions for scalar, vector, matrix, and DMA units into a single 322-bit word. everything in that bundle executes in parallel, with no runtime scheduling.

🚨 Anthropic committed up to 1M TPU chips for Claude. Openai is leasing TPUs for chatgpt inference. Here's How kernels work on TPUs (deep dive 2/6 by emi) pallas is Google's answer to kernel writing. a python kernel SDK built on JAX. still very experimental (jax.experimental.pallas). on TPU it compiles through mosaic; on GPU it lowers to triton. if you know CUDA, the syntax will feel familiar but the execution model is completely different. in CUDA, grid=(4,4) launches 16 blocks running simultaneously across SMs. in pallas, those 16 iterations run one after another in lexicographic order. no threads. no warps. no blocks. no occupancy tuning. a TPU is a sequential machine with a very wide vector register — more like a CPU than a GPU. performance comes from width: a 128x128 systolic array doing matmul and an 8x128 SIMD vector unit doing everything else. maximum parallelism on chip: 2, one per TensorCore in megacore mode. three concepts replace CUDA's thread/block/grid hierarchy. Refs are mutable memory references. because execution is sequential, each iteration safely accumulates without atomics. in CUDA you'd need atomics or a separate reduction pass. the memory model is also very different from NVIDIA's. zero hardware caches. VMEM is 32-128 MiB of software-managed scratchpad — 500-1000x larger than GPU shared memory per SM. all data must be explicitly DMA'd from HBM to VMEM before any computation touches it. four levels: HBM → VMEM → VREGs → MXU/VPU, plus SMEM for scalar control data. every byte of data movement is your responsibility. this is like CUDA shared memory except it's 500x bigger and there's no cache fallback. pipelining is mandatory. without double-buffering HBM→VMEM transfers, the MXU just stalls waiting for data. this is the single most important optimization on TPU. and because grid execution is sequential and deterministic, consecutive iterations that need the same input block skip the redundant HBM transfer automatically, impossible on GPU where block execution order is undefined. the compilation pipeline is unlike anything in this series: python → jaxpr → stableHLO → XLA HLO (71+ optimization passes) → LLO (78+ passes) → 322-bit VLIW bundles. the compiler packs instructions for scalar, vector, matrix, and DMA units into a single 322-bit word. everything in that bundle executes in parallel, with no runtime scheduling.

wafer

33,134 views • 17 days ago

The human brain is truly a marvel of nature. If you horribly reductive, and boiled it down to a language model, you'd be looking at roughly 100 trillon parameters running as a sparse MoE architecture Only about 1-5% of neurons fire at any given moment, meaning the brain "activates" maybe 1-5 trillion parameters per inference step. For context, the largest AI models we've built probably top out around 5 trillion parameters. The brain is roughly 100x larger. Even its active params at any given moment are larger than almost every model in existence today. Here's what melts my brain (pun intnended) though Your brain does all of this on about 20 watts of power, less than a dim light bulb. Training a frontier AI model consumes enough electricity to power small cities for months. Running inference across data centers pulls megawatts. Your brain runs 24/7 for 80+ years on the equivalent of a phone charger. We haven't come close to matching the brain's scale. And we're not even in the same universe when it comes to efficiency. Evolution spent 500 million yrs optimizing the most energy-efficient intelligence architecture ever known. we're trying to brute force our way there with compute and electricity. Nature is still the best engineer in the room.

The human brain is truly a marvel of nature. If you horribly reductive, and boiled it down to a language model, you'd be looking at roughly 100 trillon parameters running as a sparse MoE architecture Only about 1-5% of neurons fire at any given moment, meaning the brain "activates" maybe 1-5 trillion parameters per inference step. For context, the largest AI models we've built probably top out around 5 trillion parameters. The brain is roughly 100x larger. Even its active params at any given moment are larger than almost every model in existence today. Here's what melts my brain (pun intnended) though Your brain does all of this on about 20 watts of power, less than a dim light bulb. Training a frontier AI model consumes enough electricity to power small cities for months. Running inference across data centers pulls megawatts. Your brain runs 24/7 for 80+ years on the equivalent of a phone charger. We haven't come close to matching the brain's scale. And we're not even in the same universe when it comes to efficiency. Evolution spent 500 million yrs optimizing the most energy-efficient intelligence architecture ever known. we're trying to brute force our way there with compute and electricity. Nature is still the best engineer in the room.

am.will

130,733 views • 3 months ago

This guy built a mini AI farm out of 4 Nvidia boxes It does not look like a data center. It looks like a stack of small machines sitting next to a laptop. But each box is a DGX Spark with Grace Blackwell inside, 128GB unified memory, and enough room to run models normal gaming GPUs cannot even open. Using the launch price from the article, 4 of them is almost $12,000 of local AI compute on one desk. That sounds expensive until you compare it to cloud GPUs. A serious AI builder can burn $1,500 to $3,000 a month renting A100s and H100s for client work, fine-tunes, agents and 70B models. He basically moved that bill from the cloud into hardware he owns. 4 Nvidia boxes. 512GB unified memory. No hourly meter running in the background. No rented GPUs eating the margin every time an agent runs too long. The funny part is most people still think local AI means a slow laptop running a toy model. Meanwhile guys like this are stacking compute at home. Save this, local AI is turning into the new mining farm.

This guy built a mini AI farm out of 4 Nvidia boxes It does not look like a data center. It looks like a stack of small machines sitting next to a laptop. But each box is a DGX Spark with Grace Blackwell inside, 128GB unified memory, and enough room to run models normal gaming GPUs cannot even open. Using the launch price from the article, 4 of them is almost $12,000 of local AI compute on one desk. That sounds expensive until you compare it to cloud GPUs. A serious AI builder can burn $1,500 to $3,000 a month renting A100s and H100s for client work, fine-tunes, agents and 70B models. He basically moved that bill from the cloud into hardware he owns. 4 Nvidia boxes. 512GB unified memory. No hourly meter running in the background. No rented GPUs eating the margin every time an agent runs too long. The funny part is most people still think local AI means a slow laptop running a toy model. Meanwhile guys like this are stacking compute at home. Save this, local AI is turning into the new mining farm.

Gipp 🦅

590,100 views • 1 month ago

Day 12/90 of Inference Engineering What is chunked prefill within vLLM? In continuation of yesterday's post on the high level architecture of vLLM, I want to dive deeper into vLLM core engine starting with the mechanics of chunked prefill. In this post, I will closely follow the original blog on the anatomy of vLLM. To start, let's define chunked prefill. It's a runtime inference optimization technique that splits a long input request so that it doesn’t monopolize the whole GPU. Keep in mind this is all within the context of vLLM. And since vLLM is an inference engine that's meant to serve a model to multiple concurrent users, having a GPU that’s fully monopolized on a single user's request means other users' requests would be in queue waiting to be processed. It isn’t too good to have the whole GPU occupied on a single request when the GPU is meant to be shared! So the key idea behind chunked prefill is to break the long request into smaller chunks, so that each chunk along with other users' requests gets processed and written into the KV cache together. Suppose we split up the long request into chunks and each chunk has 8 tokens. Now each memory block can hold 4 tokens. Therefore, 8 tokens can fit into 2 blocks of memory. After the first forward pass, 2 blocks are occupied, and after the second forward pass, 4 blocks of memory are occupied and so forth. Each forward pass handles a small chunk of the long request so that there's room in the same pass to keep serving other users' requests. Here's a small animation that I made today to fully visualize the idea behind chunked prefill when learning this topic~

Day 12/90 of Inference Engineering What is chunked prefill within vLLM? In continuation of yesterday's post on the high level architecture of vLLM, I want to dive deeper into vLLM core engine starting with the mechanics of chunked prefill. In this post, I will closely follow the original blog on the anatomy of vLLM. To start, let's define chunked prefill. It's a runtime inference optimization technique that splits a long input request so that it doesn’t monopolize the whole GPU. Keep in mind this is all within the context of vLLM. And since vLLM is an inference engine that's meant to serve a model to multiple concurrent users, having a GPU that’s fully monopolized on a single user's request means other users' requests would be in queue waiting to be processed. It isn’t too good to have the whole GPU occupied on a single request when the GPU is meant to be shared! So the key idea behind chunked prefill is to break the long request into smaller chunks, so that each chunk along with other users' requests gets processed and written into the KV cache together. Suppose we split up the long request into chunks and each chunk has 8 tokens. Now each memory block can hold 4 tokens. Therefore, 8 tokens can fit into 2 blocks of memory. After the first forward pass, 2 blocks are occupied, and after the second forward pass, 4 blocks of memory are occupied and so forth. Each forward pass handles a small chunk of the long request so that there's room in the same pass to keep serving other users' requests. Here's a small animation that I made today to fully visualize the idea behind chunked prefill when learning this topic~

max fu

28,891 views • 7 days ago

‼️Copy Fail (CVE-2026-31431) is a Linux privilege escalation bug that lets any local user get root using a 732-byte Python script, and itworks on basically every major Linux distro shipped since 2017. Website: Write-up: GitHub: It's a logic flaw in the kernel's crypto code (authencesn via AF_ALG and splice()) that allows a small write into the page cache, which can be used to tamper with a setuid binary like /usr/bin/su. Think how bad this is going to be for shared environments like Kubernetes, CI runners, and cloud sandboxes, where it enables container escape and tenant-to-host compromise. Found by Theori's Xint Code scanner, patched in the mainline kernel, and publicly disclosed on April 29, 2026; if you can't patch right away, the recommended workaround is to disable the algif_aead module.

‼️Copy Fail (CVE-2026-31431) is a Linux privilege escalation bug that lets any local user get root using a 732-byte Python script, and itworks on basically every major Linux distro shipped since 2017. Website: Write-up: GitHub: It's a logic flaw in the kernel's crypto code (authencesn via AF_ALG and splice()) that allows a small write into the page cache, which can be used to tamper with a setuid binary like /usr/bin/su. Think how bad this is going to be for shared environments like Kubernetes, CI runners, and cloud sandboxes, where it enables container escape and tenant-to-host compromise. Found by Theori's Xint Code scanner, patched in the mainline kernel, and publicly disclosed on April 29, 2026; if you can't patch right away, the recommended workaround is to disable the algif_aead module.

Dark Web Informer

444,369 views • 2 months ago

BTW, for those wondering, the issue I talked about earlier was with dbrand's Killswitch for Switch 2 grip. They are a sponsor of the channel. So people naturally have questions when they see videos like this (came from their reddit). Because I am NOT a normal customer, and instead being sponsored by them, I have the unique opportunity to talk to them directly about the issue. They just got back to me. Here is what I can share from our communication: "Yep, we’re definitely working on it. Ultimately, this is a manufacturing tolerance issue on the small lip that keeps the Joy-Con Grips attached to the Joy-Cons (some were a hair too thick). We’ve got a replacement program set up already, where anyone who is affected is able to email robotsdbrand.com with “July Joy-Cons” in the subject to be added to the list for replacement stock in July." So they are aware of it, and they did tell me they can replicate it with a small number of units from their run. I won't share the exact number of units that have this issue - it's quite tiny compared to their total manufacturing run - but they have confirmed there is a certain run of units that do indeed have this issue. They are offering free replacements to be shipped in July for those affected.

BTW, for those wondering, the issue I talked about earlier was with dbrand's Killswitch for Switch 2 grip. They are a sponsor of the channel. So people naturally have questions when they see videos like this (came from their reddit). Because I am NOT a normal customer, and instead being sponsored by them, I have the unique opportunity to talk to them directly about the issue. They just got back to me. Here is what I can share from our communication: "Yep, we’re definitely working on it. Ultimately, this is a manufacturing tolerance issue on the small lip that keeps the Joy-Con Grips attached to the Joy-Cons (some were a hair too thick). We’ve got a replacement program set up already, where anyone who is affected is able to email robotsdbrand.com with “July Joy-Cons” in the subject to be added to the list for replacement stock in July." So they are aware of it, and they did tell me they can replicate it with a small number of units from their run. I won't share the exact number of units that have this issue - it's quite tiny compared to their total manufacturing run - but they have confirmed there is a certain run of units that do indeed have this issue. They are offering free replacements to be shipped in July for those affected.

Nintendo Prime

1,291,581 views • 1 year ago