Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

THIS AMERICAN DEVELOPER SPENT WEEKS DEBUGGING TIMEOUT ERRORS IN OLLAMA. THEN HE LOOKED UNDER THE HOOD LM Studio is just llama.cpp Ollama is just llama.cpp so he cloned llama.cpp from source, pulled Qwen 3.6 35B off Hugging Face, set up asymmetric KV quantization and got a local server running... show more

leopardracer

9,668 subscribers

238,950 Aufrufe • vor 2 Monaten •via X (Twitter)

Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

THIS DEVELOPER HASN’T PAID AN API BILL IN 3 MONTHS. HIS AGENTS RAN 10,000 TIMES FOR FREE he built a local AI lab under his desk two GPUs 32GB VRAM zero rate limits his agents loop 400 times if they want to his coworkers are still watching the usage dashboard the only thing separating them: llama.cpp + llama-swap every prompt stays on his machine every experiment costs $0 zero invoices bookmark & like this before your next API bill hits

THIS DEVELOPER HASN’T PAID AN API BILL IN 3 MONTHS. HIS AGENTS RAN 10,000 TIMES FOR FREE he built a local AI lab under his desk two GPUs 32GB VRAM zero rate limits his agents loop 400 times if they want to his coworkers are still watching the usage dashboard the only thing separating them: llama.cpp + llama-swap every prompt stays on his machine every experiment costs $0 zero invoices bookmark & like this before your next API bill hits

leopardracer

542,412 Aufrufe • vor 2 Monaten

THIS ENGINEER RUNS CLAUDE CODE WITHOUT PAYING ANTHROPIC A SINGLE DOLLAR RTX 5090 under the desk, Qwen 3.5 35B, 140 tokens per second Claude Code pointed at localhost instead of Anthropic’s servers override two environment variables and your API bill drops to zero he built a full-stack Next.js app from scratch to prove it works then showed every bug and limitation other YouTubers hide same setup I run with two GPUs and Qwen 3.6 27B full breakdown ↓

THIS ENGINEER RUNS CLAUDE CODE WITHOUT PAYING ANTHROPIC A SINGLE DOLLAR RTX 5090 under the desk, Qwen 3.5 35B, 140 tokens per second Claude Code pointed at localhost instead of Anthropic’s servers override two environment variables and your API bill drops to zero he built a full-stack Next.js app from scratch to prove it works then showed every bug and limitation other YouTubers hide same setup I run with two GPUs and Qwen 3.6 27B full breakdown ↓

leopardracer

29,756 Aufrufe • vor 2 Monaten

testing Qwen3.5-35B-A3B latest optimized version by UnslothAI on a single RTX 3090. one detailed prompt. zero handholding. watch a 3B model scaffold an entire multifile game project autonomously. the setup: > model: Qwen3.5-35B-A3B (80B total, only 3B active per token) > quant: UD-Q4_K_XL by Unsloth (MXFP4 layers removed in latest update) > speed: 112 tok/s generation, ~130 tok/s prefill > context: 262K tokens > flags: -ngl 99 -c 262144 -np 1 --cache-type-k q8_0 --cache-type-v q8_0 > engine: llama.cpp > agent: Claude Code talk to localhost:8080 (llama.cpp now has native Anthropic API endpoint. no LiteLLM needed) q8_0 KV cache cuts VRAM usage in half vs f16 at 262K. -np 1 is default but worth noting. parallel slots multiply KV cache and at 262K that's an instant OOM. the prompt was more detailed than this but you get the idea: build a space shooter with parallax backgrounds, particle systems, procedural audio, 4 enemy types, boss fights, power-up system, and ship upgrades. 8 JavaScript modules. no libraries. game's called Octopus Invaders. gameplay footage dropping next.

testing Qwen3.5-35B-A3B latest optimized version by UnslothAI on a single RTX 3090. one detailed prompt. zero handholding. watch a 3B model scaffold an entire multifile game project autonomously. the setup: > model: Qwen3.5-35B-A3B (80B total, only 3B active per token) > quant: UD-Q4_K_XL by Unsloth (MXFP4 layers removed in latest update) > speed: 112 tok/s generation, ~130 tok/s prefill > context: 262K tokens > flags: -ngl 99 -c 262144 -np 1 --cache-type-k q8_0 --cache-type-v q8_0 > engine: llama.cpp > agent: Claude Code talk to localhost:8080 (llama.cpp now has native Anthropic API endpoint. no LiteLLM needed) q8_0 KV cache cuts VRAM usage in half vs f16 at 262K. -np 1 is default but worth noting. parallel slots multiply KV cache and at 262K that's an instant OOM. the prompt was more detailed than this but you get the idea: build a space shooter with parallax backgrounds, particle systems, procedural audio, 4 enemy types, boss fights, power-up system, and ship upgrades. 8 JavaScript modules. no libraries. game's called Octopus Invaders. gameplay footage dropping next.

Sudo su

167,035 Aufrufe • vor 4 Monaten

There's a free 120 billion parameter AI you can run on your own computer. Zero internet. Zero API costs. Zero monthly fees. It's called Nvidia Nemotron 3 Super. And it just changed what solopreneurs can do with AI agents. Here's the setup: → Download Ollama. Copy one command into your terminal. → Connect it to OpenClaw in under 10 minutes. → Your AI agent now runs 24/7 on WhatsApp, Discord, or Telegram. → It answers customers, runs tasks, and manages workflows while you sleep. → 256,000 token context window. It reads entire documents without forgetting. No subscriptions. No cloud. No one else touching your data. A full AI employee stack for $0. Save this. Set it up this weekend.

There's a free 120 billion parameter AI you can run on your own computer. Zero internet. Zero API costs. Zero monthly fees. It's called Nvidia Nemotron 3 Super. And it just changed what solopreneurs can do with AI agents. Here's the setup: → Download Ollama. Copy one command into your terminal. → Connect it to OpenClaw in under 10 minutes. → Your AI agent now runs 24/7 on WhatsApp, Discord, or Telegram. → It answers customers, runs tasks, and manages workflows while you sleep. → 256,000 token context window. It reads entire documents without forgetting. No subscriptions. No cloud. No one else touching your data. A full AI employee stack for $0. Save this. Set it up this weekend.

Julian Goldie SEO

29,325 Aufrufe • vor 4 Monaten

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

Alok

60,866 Aufrufe • vor 29 Tagen

i pointed hermes agent at nvidia's nemotron cascade 2 30B-A3B on a single RTX 3090 24GB. IQ4_XS quant by bartowski, 187 tok/s, 625K context. had it discover its own hardware, create an identity file, then build a full GPU marketplace UI from a single prompt. it one shotted it. first attempt no iteration. qwen 3.5 35B-A3B on the same hardware same 3090 24GB took an iteration to recover from a blank screen on the same type of build. 24 days between these two models releasing. same active parameters, completely different architectures and cascade 2 through hermes agent just keeps going. this model goes on and on. feast your eyes. more iterations and tests dropping soon. nvidia really cooked. no special flags needed. nvidia optimized this mamba MoE so well it just runs. flash attention auto enabled, context auto allocated. the model does the work not the config. but i compiled llama.cpp from source and i'm not sure how it performs on other engines. if you ran nemotron on any hardware drop your numbers below. RTX, AMD, Mac, whatever. model, quant, tok/s, engine. i want to see if it holds everywhere or just on llama.cpp.

i pointed hermes agent at nvidia's nemotron cascade 2 30B-A3B on a single RTX 3090 24GB. IQ4_XS quant by bartowski, 187 tok/s, 625K context. had it discover its own hardware, create an identity file, then build a full GPU marketplace UI from a single prompt. it one shotted it. first attempt no iteration. qwen 3.5 35B-A3B on the same hardware same 3090 24GB took an iteration to recover from a blank screen on the same type of build. 24 days between these two models releasing. same active parameters, completely different architectures and cascade 2 through hermes agent just keeps going. this model goes on and on. feast your eyes. more iterations and tests dropping soon. nvidia really cooked. no special flags needed. nvidia optimized this mamba MoE so well it just runs. flash attention auto enabled, context auto allocated. the model does the work not the config. but i compiled llama.cpp from source and i'm not sure how it performs on other engines. if you ran nemotron on any hardware drop your numbers below. RTX, AMD, Mac, whatever. model, quant, tok/s, engine. i want to see if it holds everywhere or just on llama.cpp.

Sudo su

70,791 Aufrufe • vor 3 Monaten

🚨Stop gambling on the wrong local LLM. Most people download a model, watch it crawl, then start over. LLM Hardware Scanner is a CLI tool that scans your actual specs and ranks hundreds of LLMs before you waste the bandwidth: ⤷ Reads your RAM, CPU, and GPU to score real compatibility ⤷ Accounts for MoE activation not just bloated total parameter counts ⤷ Auto-selects the optimal quantization for your exact setup ⤷ Ranks models across quality, speed, context length, and true memory fit ⤷ Works natively with Ollama, llama.cpp, and MLX ⤷ Shows a composite score plus a speed estimate before you download a single byte Not a benchmark site. Not a spec sheet. A hardware-aware ranker that tells you the truth about what will actually run on your machine. One scan. Instant ranking. Zero wasted downloads.

🚨Stop gambling on the wrong local LLM. Most people download a model, watch it crawl, then start over. LLM Hardware Scanner is a CLI tool that scans your actual specs and ranks hundreds of LLMs before you waste the bandwidth: ⤷ Reads your RAM, CPU, and GPU to score real compatibility ⤷ Accounts for MoE activation not just bloated total parameter counts ⤷ Auto-selects the optimal quantization for your exact setup ⤷ Ranks models across quality, speed, context length, and true memory fit ⤷ Works natively with Ollama, llama.cpp, and MLX ⤷ Shows a composite score plus a speed estimate before you download a single byte Not a benchmark site. Not a spec sheet. A hardware-aware ranker that tells you the truth about what will actually run on your machine. One scan. Instant ranking. Zero wasted downloads.

Charlie Hills

37,499 Aufrufe • vor 4 Monaten

Unpopular opinion: OpenClaw just got outplayed. Maxclaw dropped 48 hours ago and it might be the biggest threat OpenClaw has ever seen. Same framework. Same power. Zero setup headaches. • No terminal • No server • No API juggling • No local maintenance Click a button → full agent running 24/7 in the cloud. Connected mine to Telegram in under 60 seconds. Generated images. Generated video. Scheduled daily AI news at 4:30am. It just works. So why would anyone still use OpenClaw? Developers. Privacy purists. People who want full API control. Everyone else? Maxclaw is the shortcut. AI is moving from “chatbot” to “digital employee.” This is what that looks like. 🚀

Unpopular opinion: OpenClaw just got outplayed. Maxclaw dropped 48 hours ago and it might be the biggest threat OpenClaw has ever seen. Same framework. Same power. Zero setup headaches. • No terminal • No server • No API juggling • No local maintenance Click a button → full agent running 24/7 in the cloud. Connected mine to Telegram in under 60 seconds. Generated images. Generated video. Scheduled daily AI news at 4:30am. It just works. So why would anyone still use OpenClaw? Developers. Privacy purists. People who want full API control. Everyone else? Maxclaw is the shortcut. AI is moving from “chatbot” to “digital employee.” This is what that looks like. 🚀

Julian Goldie SEO

39,927 Aufrufe • vor 4 Monaten

Meet WebBrain: An Open-Source, Local-First AI Browser Agent That Reads Pages and Automates Tasks in Chrome and Firefox WebBrain lives inside your browser and can run entirely on your own local model — no cloud, no account, no data leaving your machine. Most "AI browser agents" are a chat box that pastes your page into someone else's server. That's not an agent that lives where you browse — and WebBrain draws a very clear line between the two. It's an open-source (MIT), local-first browser agent for Chrome and Firefox. It runs inside your existing authenticated session, on a model you pick — so with llama.cpp or Ollama, nothing leaves your machine. Here's what's actually interesting: → Two modes, cleanly separated. Ask reads the page (read-only, content scripts). Act clicks and types through the Chrome DevTools Protocol (chrome.debugger) — trusted input events that modern sites honor, reaching cross-origin iframes and shadow DOM. → UI-first by design. For anything that submits, sends, or buys, it drives the visible UI and refuses to hit REST/GraphQL endpoints directly. It starts read-only and asks before consequential actions. → Bring any model. llama.cpp, Ollama, LM Studio, vLLM — or OpenAI, Claude, Gemini, DeepSeek, Groq, OpenRouter. Recommended local: Qwen 3.6 35B (Qwen3.6-35B-A3B), which beat Gemma 4 on the project's screenshot benchmark. → Tuned for cost and privacy. Token-conscious screenshots, oldest-first context trimming, a dedicated vision model, 40+ tools (~20 in Compact mode). No telemetry. No accounts. Full analysis: GitHub Repo: Chrome Extension: Firefox Add-on: Portal:

Meet WebBrain: An Open-Source, Local-First AI Browser Agent That Reads Pages and Automates Tasks in Chrome and Firefox WebBrain lives inside your browser and can run entirely on your own local model — no cloud, no account, no data leaving your machine. Most "AI browser agents" are a chat box that pastes your page into someone else's server. That's not an agent that lives where you browse — and WebBrain draws a very clear line between the two. It's an open-source (MIT), local-first browser agent for Chrome and Firefox. It runs inside your existing authenticated session, on a model you pick — so with llama.cpp or Ollama, nothing leaves your machine. Here's what's actually interesting: → Two modes, cleanly separated. Ask reads the page (read-only, content scripts). Act clicks and types through the Chrome DevTools Protocol (chrome.debugger) — trusted input events that modern sites honor, reaching cross-origin iframes and shadow DOM. → UI-first by design. For anything that submits, sends, or buys, it drives the visible UI and refuses to hit REST/GraphQL endpoints directly. It starts read-only and asks before consequential actions. → Bring any model. llama.cpp, Ollama, LM Studio, vLLM — or OpenAI, Claude, Gemini, DeepSeek, Groq, OpenRouter. Recommended local: Qwen 3.6 35B (Qwen3.6-35B-A3B), which beat Gemma 4 on the project's screenshot benchmark. → Tuned for cost and privacy. Token-conscious screenshots, oldest-first context trimming, a dedicated vision model, 40+ tools (~20 in Compact mode). No telemetry. No accounts. Full analysis: GitHub Repo: Chrome Extension: Firefox Add-on: Portal:

Marktechpost AI

202,626 Aufrufe • vor 18 Tagen

this is what a 24gb VRAM builds in 2026. one prompt. ten files. 3,483 lines of code. zero handholding. i gave Qwen3.5-35B-A3B a single detailed spec describing the full game architecture and hit enter. enemy types, particle systems, procedural audio, powerups, boss fights, ship upgrades, parallax backgrounds, everything in one message. the model planned the file structure itself, wrote every module in dependency order, wired all the imports, and served the game on port 3001. it ran on first load. when it hit a bug in collision detection it read its own error output, found the issue, fixed it, and kept building. this is pure agent loop running on local hardware. what you're looking at is pixelated octopus aliens with tentacle animations, 4 layer parallax space background with planets at different depths, a full particle system handling explosions and ink splatter and engine trails and bullet impacts, procedural audio through Web Audio API with zero sound files loaded, unleash mode with combo multiplier, boss fights every 5 levels, ship upgrades that unlock as you progress. no libraries. no frameworks. vanilla JS and Canvas. 3B active parameters. single RTX 3090. llama.cpp with q8_0 KV cache at 262K context. Claude Code pointed at localhost:8080 through the native Anthropic endpoint. no API costs. 112 tok/s. a GPU you can buy used for $800. game is called Octopus Invaders and i actually like playing it.

this is what a 24gb VRAM builds in 2026. one prompt. ten files. 3,483 lines of code. zero handholding. i gave Qwen3.5-35B-A3B a single detailed spec describing the full game architecture and hit enter. enemy types, particle systems, procedural audio, powerups, boss fights, ship upgrades, parallax backgrounds, everything in one message. the model planned the file structure itself, wrote every module in dependency order, wired all the imports, and served the game on port 3001. it ran on first load. when it hit a bug in collision detection it read its own error output, found the issue, fixed it, and kept building. this is pure agent loop running on local hardware. what you're looking at is pixelated octopus aliens with tentacle animations, 4 layer parallax space background with planets at different depths, a full particle system handling explosions and ink splatter and engine trails and bullet impacts, procedural audio through Web Audio API with zero sound files loaded, unleash mode with combo multiplier, boss fights every 5 levels, ship upgrades that unlock as you progress. no libraries. no frameworks. vanilla JS and Canvas. 3B active parameters. single RTX 3090. llama.cpp with q8_0 KV cache at 262K context. Claude Code pointed at localhost:8080 through the native Anthropic endpoint. no API costs. 112 tok/s. a GPU you can buy used for $800. game is called Octopus Invaders and i actually like playing it.

Sudo su

153,735 Aufrufe • vor 4 Monaten

single RTX 3090. 24 GB VRAM. Qwen3.5-35B-A3B. 4-bit quant, 113 tokens per second at full 262K context harnessing Claude Code locally with no API, no subscription, no proxy. told it what it is. 30 Mamba2 layers, 10 attention, 256 experts, 8 active per token. said "build something that shows off what you can do." it visualized its own architecture. interactive. tokens flowing through layers. 256 experts lighting up on routing. served in the browser from the same GPU running inference. single prompt. then i said level up. 3D. Three.js. separate files. flythrough camera. clickable layers. it planned first, scaffolded 6 files, hit one API bug, fixed it itself, then optimized for smooth framerate. two iterations to a working 3D neural network explorer. llama.cpp just merged a native Anthropic endpoint. Claude Code points at localhost. the whole setup is two commands. no LiteLLM. no proxy config. the open source models coming out of china right now are genuinely changing what's possible on consumer hardware. respect to the Qwen team. this is acceleration.

single RTX 3090. 24 GB VRAM. Qwen3.5-35B-A3B. 4-bit quant, 113 tokens per second at full 262K context harnessing Claude Code locally with no API, no subscription, no proxy. told it what it is. 30 Mamba2 layers, 10 attention, 256 experts, 8 active per token. said "build something that shows off what you can do." it visualized its own architecture. interactive. tokens flowing through layers. 256 experts lighting up on routing. served in the browser from the same GPU running inference. single prompt. then i said level up. 3D. Three.js. separate files. flythrough camera. clickable layers. it planned first, scaffolded 6 files, hit one API bug, fixed it itself, then optimized for smooth framerate. two iterations to a working 3D neural network explorer. llama.cpp just merged a native Anthropic endpoint. Claude Code points at localhost. the whole setup is two commands. no LiteLLM. no proxy config. the open source models coming out of china right now are genuinely changing what's possible on consumer hardware. respect to the Qwen team. this is acceleration.

Sudo su

110,206 Aufrufe • vor 4 Monaten

A CHINESE LAB VISUALIZED 2,000 NEURONS AND 1,191,000 SYNAPSES IN 3D. HERMES AGENT HIT 140,000 STARS RUNNING THE SAME NETWORK LOCALLY FOR $0 hermes is the first agent that writes its own skills from experience. complete a task once and it saves the procedure as a markdown file for next time agents with 20+ self-created skills complete similar tasks 40% faster than fresh instances. less time and less tokens to get the same result qwen 3.6 35b outperforms last year's 120b models and runs on 20gb of memory. the intelligence that needed a data center now fits on your desk setup takes 30 minutes. install lm studio, pull qwen 3.6, install hermes, point it at localhost. zero api fees, zero data leaving your machine most people pay for cloud agents that forget everything between sessions. the ones running hermes locally in 2026 will look very far ahead in 2028 bookmark this and read the article below

A CHINESE LAB VISUALIZED 2,000 NEURONS AND 1,191,000 SYNAPSES IN 3D. HERMES AGENT HIT 140,000 STARS RUNNING THE SAME NETWORK LOCALLY FOR $0 hermes is the first agent that writes its own skills from experience. complete a task once and it saves the procedure as a markdown file for next time agents with 20+ self-created skills complete similar tasks 40% faster than fresh instances. less time and less tokens to get the same result qwen 3.6 35b outperforms last year's 120b models and runs on 20gb of memory. the intelligence that needed a data center now fits on your desk setup takes 30 minutes. install lm studio, pull qwen 3.6, install hermes, point it at localhost. zero api fees, zero data leaving your machine most people pay for cloud agents that forget everything between sessions. the ones running hermes locally in 2026 will look very far ahead in 2028 bookmark this and read the article below

starmex

13,420 Aufrufe • vor 1 Monat

GOOGLE JUST MADE EVERY CHATBOT FEEL SLOW Diffusion Gemma 26b doesn’t predict word by word, it generates 256 tokens in parallel using bi-directional attention, like stable diffusion but for language it’s MoE so only 3.8B params activate during inference, fits on a single RTX 4090 with 18GB VRAM and you can run it right now via llama.cpp, the same engine powering every tool in my local llm playbook the whole ecosystem exists for moments like this ↓

GOOGLE JUST MADE EVERY CHATBOT FEEL SLOW Diffusion Gemma 26b doesn’t predict word by word, it generates 256 tokens in parallel using bi-directional attention, like stable diffusion but for language it’s MoE so only 3.8B params activate during inference, fits on a single RTX 4090 with 18GB VRAM and you can run it right now via llama.cpp, the same engine powering every tool in my local llm playbook the whole ecosystem exists for moments like this ↓

leopardracer

27,140 Aufrufe • vor 1 Monat

Alibaba just dropped a free AI that codes like a $150/hr developer. And it runs on your laptop. Qwen 3 Coder Next has 80 billion parameters but only uses 3 billion at a time. So you get massive model power with tiny model speed. The real kicker? It handles 256,000 tokens of context. That's your entire codebase. Every file. All at once. People are building 2D games in seconds. Full IDE integrations that write, test, and fix code automatically. Custom business tools without hiring devs. It works with OpenClaw, Claude Code, and every major coding tool. No complicated setup. Just download and go. Completely open source. Zero API queues. You own it. Your competitors are using this right now.

Alibaba just dropped a free AI that codes like a $150/hr developer. And it runs on your laptop. Qwen 3 Coder Next has 80 billion parameters but only uses 3 billion at a time. So you get massive model power with tiny model speed. The real kicker? It handles 256,000 tokens of context. That's your entire codebase. Every file. All at once. People are building 2D games in seconds. Full IDE integrations that write, test, and fix code automatically. Custom business tools without hiring devs. It works with OpenClaw, Claude Code, and every major coding tool. No complicated setup. Just download and go. Completely open source. Zero API queues. You own it. Your competitors are using this right now.

Julian Goldie SEO

30,259 Aufrufe • vor 5 Monaten

A single RTX 4090 (24 GB VRAM) can run the updated gemma 4 31B (dense) model with a 190,000 context window at 33 tokens/second. The VRAM barrier is dying. Google quietly updated Gemma 4, and Unsloth immediately compiled the new quants. I built llama.cpp from source on Ubuntu 22 to benchmark it. Google's stealth update 2 days ago enabled uniform Flash Attention 4 on Hopper to boost prefill and patched the chat template to improve tool calling. The agentic reasoning gains on the benchmark charts are massive: TB2 (Agents): +4.5% (to 25.8%) Tau2 (Telecom): +10.1% (to 62.7%) Running on Ubuntu 22, CUDA 13.0 with a single NVIDIA GeForce RTX 4090. Here is the exact step by step benchmarking process with a massive 28k tokens prompt and the commands I used to squeeze out maximum context without killing my throughput: # 1. The Baseline (Unquantized KV Cache) I started with full GPU offload (-ngl 99) and pushed the context to 40k. llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -ngl 99 -c 40000 -fa on --port 8080 -v VRAM: 23.8 GB (maxed out on card) Throughput: Prefill: 2198.81 t/s | Decode: 35.77 t/s (with 28k tokens prompt) # 2. The CPU Split Trap I tried stretching to 80k context by offloading layers to the CPU (-ngl 52). llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 80000 -ngl 52 -fa on --port 8080 -v Throughput: Prefill: 1212.73 t/s | Decode: 5 t/s (with 28k tokens prompt) # 3. The KV Quantization Breakthrough Instead of spilling layers to the CPU, I kept the model fully on card (-ngl 99) but enabled 8-bit KV cache quantization to free up VRAM. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --port 8080 -v VRAM: 23.9 GB Throughput: Prefill: 2139.68 t/s | Decode: 32 t/s (with 28k tokens prompt) Result: 100k tokens of context on a single GPU with practically zero speed loss (and minimal intelligence loss). # 4. The Limit Test (Q4 KV Cache) To find the absolute breaking point, I dropped the KV cache to 4 bit (q4_0) and set -c 190000. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 190000 --cache-type-k q4_0 --cache-type-v q4_0 -ngl 99 --port 8080 -v VRAM: 23.8 GB Throughput: Prefill: 2206.66 t/s | Decode: 33 t/s (with 28k tokens prompt) (Note: Pushing it to 220k required dropping to -ngl 58 again, which immediately penalized decode down to 17 t/s). # The Tradeoff: For Max Reasoning: Keep your KV cache unquantized (f16). You get pristine reasoning but hit a strict 40k context ceiling. For Massive Document Retrieval: If you need to feed the model giant codebases, use --cache-type-k q4_0. Getting 190k context at 33 tokens/second on a consumer desktop with a 31b dense model is a cheat code. If you’re rocking a single 3090 or 4090 and slept on Gemma 4 earlier, this update is your cue to dust off the terminal. Hugging Face links to the Unsloth QAT quants are in the replies below.

A single RTX 4090 (24 GB VRAM) can run the updated gemma 4 31B (dense) model with a 190,000 context window at 33 tokens/second. The VRAM barrier is dying. Google quietly updated Gemma 4, and Unsloth immediately compiled the new quants. I built llama.cpp from source on Ubuntu 22 to benchmark it. Google's stealth update 2 days ago enabled uniform Flash Attention 4 on Hopper to boost prefill and patched the chat template to improve tool calling. The agentic reasoning gains on the benchmark charts are massive: TB2 (Agents): +4.5% (to 25.8%) Tau2 (Telecom): +10.1% (to 62.7%) Running on Ubuntu 22, CUDA 13.0 with a single NVIDIA GeForce RTX 4090. Here is the exact step by step benchmarking process with a massive 28k tokens prompt and the commands I used to squeeze out maximum context without killing my throughput: # 1. The Baseline (Unquantized KV Cache) I started with full GPU offload (-ngl 99) and pushed the context to 40k. llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -ngl 99 -c 40000 -fa on --port 8080 -v VRAM: 23.8 GB (maxed out on card) Throughput: Prefill: 2198.81 t/s | Decode: 35.77 t/s (with 28k tokens prompt) # 2. The CPU Split Trap I tried stretching to 80k context by offloading layers to the CPU (-ngl 52). llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 80000 -ngl 52 -fa on --port 8080 -v Throughput: Prefill: 1212.73 t/s | Decode: 5 t/s (with 28k tokens prompt) # 3. The KV Quantization Breakthrough Instead of spilling layers to the CPU, I kept the model fully on card (-ngl 99) but enabled 8-bit KV cache quantization to free up VRAM. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --port 8080 -v VRAM: 23.9 GB Throughput: Prefill: 2139.68 t/s | Decode: 32 t/s (with 28k tokens prompt) Result: 100k tokens of context on a single GPU with practically zero speed loss (and minimal intelligence loss). # 4. The Limit Test (Q4 KV Cache) To find the absolute breaking point, I dropped the KV cache to 4 bit (q4_0) and set -c 190000. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 190000 --cache-type-k q4_0 --cache-type-v q4_0 -ngl 99 --port 8080 -v VRAM: 23.8 GB Throughput: Prefill: 2206.66 t/s | Decode: 33 t/s (with 28k tokens prompt) (Note: Pushing it to 220k required dropping to -ngl 58 again, which immediately penalized decode down to 17 t/s). # The Tradeoff: For Max Reasoning: Keep your KV cache unquantized (f16). You get pristine reasoning but hit a strict 40k context ceiling. For Massive Document Retrieval: If you need to feed the model giant codebases, use --cache-type-k q4_0. Getting 190k context at 33 tokens/second on a consumer desktop with a 31b dense model is a cheat code. If you’re rocking a single 3090 or 4090 and slept on Gemma 4 earlier, this update is your cue to dust off the terminal. Hugging Face links to the Unsloth QAT quants are in the replies below.

Alok

73,053 Aufrufe • vor 2 Tagen

I freaked out when my WiFi router suddenly died. then realized my autonomous Hermes agent is running fully local, nothing stopped. Hermes Agent + Gemma 4 26B A4B QAT MoE, 100% local on my laptop, building my side projects while I scroll my phone zero API calls. zero cost. 100% private. fully offline. This might be the most satisfying thing I’ve watched in a while. last post: showed Hermes + local Gemma 4 26B pull off backtest a trading strategy. this time I asked it to develop something i'd use myself everyday: # A full unpacked extension with: - React side panel UI - Local llama.cpp backend (offline AI) - Live tab sync + status tracking - Auto context extraction via Readability.js Vision on Demand → captures viewport screenshots as compressed JPEGs Deterministic action system -> model outputs tokens -> directly controls page scrolling It planned everything first. Then started executing step by step. all i did was say 'ok'. only once. # What’s wild: - It reports back after every phase - Auto compresses context when nearing limits - Actualy, stays on track llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 --cache-type-k q8_0 --cache-type-v q8_0 --port 8080 # Performance on a single NVIDIA RTX 4060 (8GB VRAM) + 16 GB DDR4 RAM Gaming Laptop: - 300 tokens/sec prefill - 25+ tokens/sec decode More than usable for real dev workflows. This isn’t AI demo territory anymore. This is autonomous local software actually building things.

I freaked out when my WiFi router suddenly died. then realized my autonomous Hermes agent is running fully local, nothing stopped. Hermes Agent + Gemma 4 26B A4B QAT MoE, 100% local on my laptop, building my side projects while I scroll my phone zero API calls. zero cost. 100% private. fully offline. This might be the most satisfying thing I’ve watched in a while. last post: showed Hermes + local Gemma 4 26B pull off backtest a trading strategy. this time I asked it to develop something i'd use myself everyday: # A full unpacked extension with: - React side panel UI - Local llama.cpp backend (offline AI) - Live tab sync + status tracking - Auto context extraction via Readability.js Vision on Demand → captures viewport screenshots as compressed JPEGs Deterministic action system -> model outputs tokens -> directly controls page scrolling It planned everything first. Then started executing step by step. all i did was say 'ok'. only once. # What’s wild: - It reports back after every phase - Auto compresses context when nearing limits - Actualy, stays on track llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 --cache-type-k q8_0 --cache-type-v q8_0 --port 8080 # Performance on a single NVIDIA RTX 4060 (8GB VRAM) + 16 GB DDR4 RAM Gaming Laptop: - 300 tokens/sec prefill - 25+ tokens/sec decode More than usable for real dev workflows. This isn’t AI demo territory anymore. This is autonomous local software actually building things.

Alok

56,039 Aufrufe • vor 23 Tagen

Boris Cherny, the creator of Claude Code at Anthropic, just explained how he runs thousands of agents from his phone while he sleeps in this talk he breaks down exactly how he manages thousands of agents from his phone: - the 14% you lose to CLAUDE.md before typing a word - the architecture behind running a few thousand agents overnight - why the system matters more than the tool - the knowledge structure that makes everything findable, connected, and useful if you've been using Claude for months and every conversation still starts from scratch with zero memory and zero context, you don't have a system. you have a chat history instead of another show tonight, watch this make sure to bookmark it before it gets lost in your feed the guide is in the article below

Boris Cherny, the creator of Claude Code at Anthropic, just explained how he runs thousands of agents from his phone while he sleeps in this talk he breaks down exactly how he manages thousands of agents from his phone: - the 14% you lose to CLAUDE.md before typing a word - the architecture behind running a few thousand agents overnight - why the system matters more than the tool - the knowledge structure that makes everything findable, connected, and useful if you've been using Claude for months and every conversation still starts from scratch with zero memory and zero context, you don't have a system. you have a chat history instead of another show tonight, watch this make sure to bookmark it before it gets lost in your feed the guide is in the article below

Khairallah AL-Awady

86,247 Aufrufe • vor 1 Monat

HERMES AGENT NOW RUNS ON AN 8GB LAPTOP GPU JUST AS EASILY AS IT RUNS ON A 128GB MINI PC Nous Research shipped the official Hermes Agent Desktop App this week. Someone pointed it at a local llama server running on an RTX 4060 with 16GB system RAM. The integration took two minutes The model behind it: Gemma 4 26B MoE, QAT quantized, running on 8GB of VRAM. A 60k token prompt held a stable 20 tokens a second, flat, no slowdown as context grew. The flags were nothing exotic, just -cmoe -c 248000 on llama.cpp What that 8GB setup does out of the box: reads and patches its own code, runs it in a terminal, debugs errors, manages GitHub repos, spawns sub-agents for parallel work. Browses the web with vision to debug a UI. Schedules cron jobs in plain language. Connects to Notion, Google Workspace, Linear, and Obsidian to manage tasks on its own That's the same agent layer running on a Minisforum MS-S1 MAX with 128GB of unified memory, 96GB of it to the GPU, holding a 120B model at 56 tokens a second instead of a 26B model at 20. Same software, same tool execution, same zero API key. The only thing that changes between an $800 laptop and a $2,000 mini PC is how big a model you can afford to run underneath it The barrier to running a real autonomous agent locally didn't just drop. It dropped all the way down to hardware most people already own

HERMES AGENT NOW RUNS ON AN 8GB LAPTOP GPU JUST AS EASILY AS IT RUNS ON A 128GB MINI PC Nous Research shipped the official Hermes Agent Desktop App this week. Someone pointed it at a local llama server running on an RTX 4060 with 16GB system RAM. The integration took two minutes The model behind it: Gemma 4 26B MoE, QAT quantized, running on 8GB of VRAM. A 60k token prompt held a stable 20 tokens a second, flat, no slowdown as context grew. The flags were nothing exotic, just -cmoe -c 248000 on llama.cpp What that 8GB setup does out of the box: reads and patches its own code, runs it in a terminal, debugs errors, manages GitHub repos, spawns sub-agents for parallel work. Browses the web with vision to debug a UI. Schedules cron jobs in plain language. Connects to Notion, Google Workspace, Linear, and Obsidian to manage tasks on its own That's the same agent layer running on a Minisforum MS-S1 MAX with 128GB of unified memory, 96GB of it to the GPU, holding a 120B model at 56 tokens a second instead of a 26B model at 20. Same software, same tool execution, same zero API key. The only thing that changes between an $800 laptop and a $2,000 mini PC is how big a model you can afford to run underneath it The barrier to running a real autonomous agent locally didn't just drop. It dropped all the way down to hardware most people already own

NO1ennn

40,079 Aufrufe • vor 1 Monat

AN OPENCLAW AI AGENT JUST DESIGNED A 3D CHARACTER IN BLENDER AND SENT IT TO A 3D PRINTER. SHE DIDN'T TOUCH A SINGLE BUTTON. This is what running an OpenClaw AI agent on a Mac Mini actually looks like. She described what she wanted. The agent did everything else. Blender opened automatically. Character designed. Put in space. Animated with blinking eyes and stomping feet. Landing page built and live. Model sliced. File sent to 3D printer. Printer started moving. All from conversation. Zero code. Zero design skills. Zero manual work. Here's why this matters beyond the demo: Custom 3D printed figures sell for $30-$200 each on Etsy and at conventions. The people making serious money aren't artists. They're the ones who can produce custom designs fast. An AI agent that goes from description to printed figure autonomously is a business model not a toy. "I was just talking to the guy and he made it happen." Bookmark this. Full demo in the video below.

AN OPENCLAW AI AGENT JUST DESIGNED A 3D CHARACTER IN BLENDER AND SENT IT TO A 3D PRINTER. SHE DIDN'T TOUCH A SINGLE BUTTON. This is what running an OpenClaw AI agent on a Mac Mini actually looks like. She described what she wanted. The agent did everything else. Blender opened automatically. Character designed. Put in space. Animated with blinking eyes and stomping feet. Landing page built and live. Model sliced. File sent to 3D printer. Printer started moving. All from conversation. Zero code. Zero design skills. Zero manual work. Here's why this matters beyond the demo: Custom 3D printed figures sell for $30-$200 each on Etsy and at conventions. The people making serious money aren't artists. They're the ones who can produce custom designs fast. An AI agent that goes from description to printed figure autonomously is a business model not a toy. "I was just talking to the guy and he made it happen." Bookmark this. Full demo in the video below.

SCOTTY BEAM

17,209 Aufrufe • vor 1 Monat

first impressions of qwen 3.5 27B dense on a single RTX 3090. 35 tok/s. from 4K all the way to 300K+ context. no speed drop. hermes 4.3 started at 35 and degraded to 15 as context filled. qwen dense holds. MoE held 112 flat. 3x faster but only 3B of 35B active per token. architecture tradeoff. Q4_K_M on 16.7GB. native context 262K. pushed past training limit to 376K before VRAM ceiling on 24GB. tried q8 KV cache at 262K, speed collapsed to 11 tok/s. q4_0 KV is the sweet spot. flash attention mandatory. built in reasoning mode. the model thinks step by step before it answers. full chain of thought surviving Q4 quant. 1,799+ token thinking chains with self correction loops. on a single consumer GPU. gave it one prompt: "build a realtime particle galaxy simulation in one HTML file." 3,340 tokens. 95 seconds. one shot. ran on first load. full reasoning and coding in the video below. optimal config if you want to skip the hours of testing: llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 this is just the warmup. octopus invaders is next: 10 files, 3,400+ lines, zero steering. the prompt hermes quit at 22%. already more impressed than expected. full results coming soon.

first impressions of qwen 3.5 27B dense on a single RTX 3090. 35 tok/s. from 4K all the way to 300K+ context. no speed drop. hermes 4.3 started at 35 and degraded to 15 as context filled. qwen dense holds. MoE held 112 flat. 3x faster but only 3B of 35B active per token. architecture tradeoff. Q4_K_M on 16.7GB. native context 262K. pushed past training limit to 376K before VRAM ceiling on 24GB. tried q8 KV cache at 262K, speed collapsed to 11 tok/s. q4_0 KV is the sweet spot. flash attention mandatory. built in reasoning mode. the model thinks step by step before it answers. full chain of thought surviving Q4 quant. 1,799+ token thinking chains with self correction loops. on a single consumer GPU. gave it one prompt: "build a realtime particle galaxy simulation in one HTML file." 3,340 tokens. 95 seconds. one shot. ran on first load. full reasoning and coding in the video below. optimal config if you want to skip the hours of testing: llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 this is just the warmup. octopus invaders is next: 10 files, 3,400+ lines, zero steering. the prompt hermes quit at 22%. already more impressed than expected. full results coming soon.

Sudo su

120,788 Aufrufe • vor 4 Monaten