
Sudo su
@sudoingX • 30,358 subscribers
GPU/local LLM. more RAM and OSS... everywhere
Shorts
Videos

i think spacex browser game just fixed my fried focus, lol spent 20 min last night manually docking a crew dragon to the iss and my head went dead quiet for the first time in weeks. it's the real spacex sim, free, runs in the browser, no install. ease the translation in, null your rotation rate, hold the crosshairs dead center, soft capture. and that's the actual manual procedure the crew falls back on if autonomous docking ever drops. go dock something tonight.
Sudo su151,699 Aufrufe • vor 4 Tagen

ok this is wild. 10 year old gtx 1080 8gb pascal card running qwen3 8b locally at 18-20 tok/s via hermes agent and it's actually doing the thing. asked it to build a wireworld cellular automata simulator with 10 tests. autonomous run, no hand holding. expected it to fail on the tool calls. that's not what happened. write_file works. browser_navigate works. terminal commands work. file ops, package installs, version probes, environment setup. agent is firing tool calls cleanly and the model is reasoning about next steps at 18-20 tok/s. on hardware that pre dates "agentic" as a word. it even hit an npm install fail because node 12 is too old. didn't crash. didn't ask me. just started bootstrapping nvm on its own to fix the environment. 10 minutes in. 40% context used. 7.5gb of 8gb vram occupied. still going. i did not think this would work on this hardware. this is the most i've been wrong this month.
Sudo su91,555 Aufrufe • vor 8 Tagen

update: qwen 3.6 27b dense q4 just one shotted octopus invaders game on a single 3090. hermes agent drove the whole thing, ~41 tok/s gen 21gb vram at full 262k context, thinking mode on. one prompt in and the canonical multi-file space shooter benchmark out, the same exact prompt i ran on qwen 3.5 27b dense back in march on the same card. 3.5 needed one external scope bug fix before the game would even load on first play. 3.6 needed nothing. 11 of 11 files written, 2411 lines of code, zero steering interventions, zero external fixes, playable on first load. 16 minutes 41 seconds wall clock from prompt to playable. consumer tier king on a single 3090 is locked tonight, and the silicon underneath my desk did not change between march and now. the open source ecosystem just moved the floor. watch it ship itself, the full 16 minutes 41 seconds sped to 3 minutes 45, no human touched the keyboard between the first prompt and the final frame.
Sudo su121,971 Aufrufe • vor 24 Tagen

Qwopus on a single RTX 3090. Claude Opus 4.6 reasoning distilled into Qwen 3.5 27B dense, running through Claude's own coding agent (claude code). 29-35 tok/s with thinking mode on. the jinja bug that kills thinking on base Qwen doesn't carry over. harness and model matched. the base model would pause mid task on Claude Code. just stop generating. that's why i ran it through OpenCode, which handles stalled states automatically. this distilled version doesn't stall. it waits for tool outputs, reads them, selfcorrects when something breaks, and keeps going. i gave it a benchmark analysis task. went 9 minutes autonomous. wrote a README nobody asked for. zero steering. video is 5x speed but fully uncut. if you have a 3090, you can run this right now. free. no API. no subscription. opus structured reasoning on localhost. octopus invaders is next. same prompt that base qwen passed in 13 minutes and hermes 4.3 failed on 2x the hardware. i want to see if the distillation changes the outcome or just the style. more data soon.
Sudo su295,005 Aufrufe • vor 2 Monaten

first test results are in. qwen 3.6 27b dense just banged 10 out of 10 on a single rtx 3090 24gb tier at 40 tok/s. no quant tricks. no fused kernels. just q4_k_m straight cut on llama.cpp. i wrote a particle swarm benchmark this morning, fed it the prompt, and the model autonomously built a 500 particle boids flocking system. velocity driven hue, density based brightness, trail blend rendering, mouse attraction physics, click bursts, drag paint. then it used browser automation to test its own work, found the failing tests, iterated through the code, patched tests.js, and landed all 10 green on its own. i sat there hooked for 8 minutes playing. simple but mesmerizing. mouse trails build beautiful patterns, palette cycles with space, click sends particles flying, drag paints through the swarm. simplicity that hooks you. i'll open source this prompt and the build soon so anyone can reproduce it as their own benchmark. this is the first of 5 single file agent tests i wrote for this model. four more coming. octopus invaders flagship after as final. watch the full video below. see it autonomously build from one prompt. haven't slept well since this model dropped yesterday.
Sudo su132,123 Aufrufe • vor 1 Monat

nvidia's 3B mamba destroyed alibaba's 3B deltanet on the same RTX 3090. only 24 days between releases. same active parameters, same VRAM tier, completely different architectures. nemotron cascade 2: 187 tok/s. flat from 4K to 625K context. zero speed loss. flags: -ngl 99 -np 1. that's it. no context flags, no KV cache tricks. auto-allocates 625K. qwen 3.5 35B-A3B: 112 tok/s. flat from 4K to 262K context. zero speed loss. flags: -ngl 99 -np 1 -c 262144 --cache-type-k q8_0 --cache-type-v q8_0. needed KV cache quantization to fit 262K. both models held a flat line across every context level. both architectures are context-independent. but nvidia's mamba2 is 67% faster at generating tokens on the exact same hardware and needs fewer flags to get there. same node, same GPU, same everything. the only variable is the model. gold medal math olympiad winner running at 187 tokens per second on single RTX 3090 a card from 6 years ago. nvidia cooked.
Sudo su186,203 Aufrufe • vor 2 Monaten

this is what 12 gigs of VRAM built in 2026. a 9 billion parameter model running on a 5 year old RTX 3060 wrote a full space shooter from a single prompt. blank screen on first try. i came back with a bug list and the same model on the same card fixed every issue across 11 files without touching a single line myself. enemies still looked wrong so i pushed another iteration and now the game has pixel art octopi, particle effects, screen shake, projectile physics and a combo system. all running locally on a card that was designed to play fortnite. three iterations. zero cloud. zero API calls. every token generated on hardware sitting under my desk. the model reads its own code, finds what's broken, patches it, validates syntax and restarts the server. i just describe what's wrong and it handles the rest. people are paying monthly subscriptions to type into a browser tab and wait for a server farm to respond. meanwhile a GPU you can find used on ebay is running a full autonomous hermes agent framework with 31 tools, 128K context window and thinking mode generating at 29 tokens per second nonstop. the game still needs work. level upgrades don't trigger and boss fights need tuning. but the fact that i'm iterating on gameplay balance instead of debugging whether the code runs at all tells you where this is headed. every iteration the game gets better on the same hardware. same 12 gigs. same 9 billion parameters. same RTX 3060 from 5 years ago your GPU is not a gaming card anymore. it's a local AI lab that never sends your data anywhere.
Sudo su170,186 Aufrufe • vor 2 Monaten

okay the fuss around hermes agent is not just air. this thing has substance. installed it on a single RTX 3090 running Qwen 3.5 27B base (Q4_K_M, 262K context, 29-35 tok/s). fully local. my machine my data. first thing i did was tell it to discover itself. find its own model weights, check its own GPU, read its own server flags, and write its own identity document. it did all of it autonomously. nvidia-smi, process grep, file writes. clean execution. the TUI is genuinely premium. dark theme, ASCII art, color coded tool calls with execution times, real time streaming. you actually enjoy watching it work. 29 tools. 80 skills (that's what it reports on boot). file ops, terminal, browser automation, code execution, cron scheduling, subagent delegation. and it has persistent memory across sessions. setup took 5 minutes. one curl install, setup wizard, point to localhost:8080/v1, done. dropping qwopus for this test btw. distilled models compress reasoning and lose precision on real coding tasks. base model only from here. more experiments coming. octopus invaders (the same game that broke qwopus) will be built using hermes agent next. comparing flow and results against claude code on the same model. if you want to run local AI agents on real hardware this one deserves a serious look.
Sudo su162,022 Aufrufe • vor 2 Monaten

testing Qwen3.5-35B-A3B latest optimized version by UnslothAI on a single RTX 3090. one detailed prompt. zero handholding. watch a 3B model scaffold an entire multifile game project autonomously. the setup: > model: Qwen3.5-35B-A3B (80B total, only 3B active per token) > quant: UD-Q4_K_XL by Unsloth (MXFP4 layers removed in latest update) > speed: 112 tok/s generation, ~130 tok/s prefill > context: 262K tokens > flags: -ngl 99 -c 262144 -np 1 --cache-type-k q8_0 --cache-type-v q8_0 > engine: llama.cpp > agent: Claude Code talk to localhost:8080 (llama.cpp now has native Anthropic API endpoint. no LiteLLM needed) q8_0 KV cache cuts VRAM usage in half vs f16 at 262K. -np 1 is default but worth noting. parallel slots multiply KV cache and at 262K that's an instant OOM. the prompt was more detailed than this but you get the idea: build a space shooter with parallax backgrounds, particle systems, procedural audio, 4 enemy types, boss fights, power-up system, and ship upgrades. 8 JavaScript modules. no libraries. game's called Octopus Invaders. gameplay footage dropping next.
Sudo su166,905 Aufrufe • vor 3 Monaten

this is what a 24gb VRAM builds in 2026. one prompt. ten files. 3,483 lines of code. zero handholding. i gave Qwen3.5-35B-A3B a single detailed spec describing the full game architecture and hit enter. enemy types, particle systems, procedural audio, powerups, boss fights, ship upgrades, parallax backgrounds, everything in one message. the model planned the file structure itself, wrote every module in dependency order, wired all the imports, and served the game on port 3001. it ran on first load. when it hit a bug in collision detection it read its own error output, found the issue, fixed it, and kept building. this is pure agent loop running on local hardware. what you're looking at is pixelated octopus aliens with tentacle animations, 4 layer parallax space background with planets at different depths, a full particle system handling explosions and ink splatter and engine trails and bullet impacts, procedural audio through Web Audio API with zero sound files loaded, unleash mode with combo multiplier, boss fights every 5 levels, ship upgrades that unlock as you progress. no libraries. no frameworks. vanilla JS and Canvas. 3B active parameters. single RTX 3090. llama.cpp with q8_0 KV cache at 262K context. Claude Code pointed at localhost:8080 through the native Anthropic endpoint. no API costs. 112 tok/s. a GPU you can buy used for $800. game is called Octopus Invaders and i actually like playing it.
Sudo su153,078 Aufrufe • vor 3 Monaten

this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 of 24gb vram gone, 94 watts at 50c. no api keys. no rate limits. no "your prompts are being used for training". no monthly subscription. no anthropic telling me what i can and cant ask. no openai logging my work. no outages when aws goes down. just google deepmind's open weights, open source llama.cpp, nous research's hermes agent, a rog scar 18 on my desk, and 95 watts of sustained compute while it builds stuff on its own. the laptop is roaring. results incoming.
Sudo su65,567 Aufrufe • vor 1 Monat

first impressions of qwen 3.5 27B dense on a single RTX 3090. 35 tok/s. from 4K all the way to 300K+ context. no speed drop. hermes 4.3 started at 35 and degraded to 15 as context filled. qwen dense holds. MoE held 112 flat. 3x faster but only 3B of 35B active per token. architecture tradeoff. Q4_K_M on 16.7GB. native context 262K. pushed past training limit to 376K before VRAM ceiling on 24GB. tried q8 KV cache at 262K, speed collapsed to 11 tok/s. q4_0 KV is the sweet spot. flash attention mandatory. built in reasoning mode. the model thinks step by step before it answers. full chain of thought surviving Q4 quant. 1,799+ token thinking chains with self correction loops. on a single consumer GPU. gave it one prompt: "build a realtime particle galaxy simulation in one HTML file." 3,340 tokens. 95 seconds. one shot. ran on first load. full reasoning and coding in the video below. optimal config if you want to skip the hours of testing: llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 this is just the warmup. octopus invaders is next: 10 files, 3,400+ lines, zero steering. the prompt hermes quit at 22%. already more impressed than expected. full results coming soon.
Sudo su119,993 Aufrufe • vor 3 Monaten

been playing with hermes agent paired with qwen 3.5 dense 27B on my single 3090 since last night. there is something about this harness that caught me and i think i know what it is. i've now run five qwen configs on consumer hardware: 35B MoE (3B active) -- 112 tok/s flat across 262K context, 1x 3090 27B dense -- 35 tok/s, zero degradation across the same range, 1x 3090 qwopus 27B (opus distilled) -- 35.7 tok/s, same architecture, different brain 80B coder -- 46 tok/s on 2x 3090s, oneshotted a 564 line particle sim 80B coder -- 1.3 tok/s on 1x 3090, bleeding through RAM because it didn't fit but it still ran with same benchmarks. same prompts. same quant where possible. every config is documented. i know these models. and hermes agent is the first harness that feels like it respects that work. tool calls show inline with execution time. nvidia-smi 0.2s. write_file 0.7s. you see exactly what the agent is doing and how long each step takes. no mystery. no black box. no tool call failures so far and i've been pushing it. most agent frameworks feel like you're watching a spinner and hoping. hermes shows the work. that transparency changes how you trust the output. once you use it you see the UX decisions are not accidental. Teknium 🪽 and the nous team built this like engineers who actually use their own tools. 80 skills. 29 tools. persistent memory. context compression. runs clean on a single consumer GPU.
Sudo su100,109 Aufrufe • vor 2 Monaten

single RTX 3090. 24 GB VRAM. Qwen3.5-35B-A3B. 4-bit quant, 113 tokens per second at full 262K context harnessing Claude Code locally with no API, no subscription, no proxy. told it what it is. 30 Mamba2 layers, 10 attention, 256 experts, 8 active per token. said "build something that shows off what you can do." it visualized its own architecture. interactive. tokens flowing through layers. 256 experts lighting up on routing. served in the browser from the same GPU running inference. single prompt. then i said level up. 3D. Three.js. separate files. flythrough camera. clickable layers. it planned first, scaffolded 6 files, hit one API bug, fixed it itself, then optimized for smooth framerate. two iterations to a working 3D neural network explorer. llama.cpp just merged a native Anthropic endpoint. Claude Code points at localhost. the whole setup is two commands. no LiteLLM. no proxy config. the open source models coming out of china right now are genuinely changing what's possible on consumer hardware. respect to the Qwen team. this is acceleration.
Sudo su109,512 Aufrufe • vor 3 Monaten

look anon, those of you who kept saying local AI is not there yet, who said open source can't compete, who said you need cloud APIs to get anything serious done, look at this gameplay for one minute. every pixel on this screen was written by one model, in one shot, on a single rtx 3090 with 24gb of vram. the model is qwen 3.6 27b dense q4. the harness is hermes agent. the hardware is a single consumer card you can buy used for 900 dollars. the prompt is open source on github. every claim verifiable, on your own desk. if your local AI take is from 2024, update it. the consumer tier is shipping work that was supposed to need 8 gpus and an api key. open source moved the floor while the rest of the field was busy explaining why it cannot. 24gb tier owners are eating ramen with half boiled egg and double chocolate.
Sudo su29,528 Aufrufe • vor 23 Tagen

the tiebreaker is done. qwen 3.5 27B dense. single RTX 3090. one prompt. zero steering. zero human edits. 1,827 lines across 10 files. 13 minutes. full thinking mode. runs on first load. hermes 4.3 got the same prompt with 2x 3090s and 5x the context it needed. wrote 1,249 lines, left empty files, needed 3 interventions, game was broken on load. same architecture class. same quant. hermes got double the hardware. completely different result. dense wasn't the problem. hermes was. but here's what got me. this model thinks at 27 tok/s. every single token carries 27 billion parameters of reasoning. MoE hit 112 tok/s but only 3B active per token. the dense model is slower and it doesn't matter. watch 13 minutes of autonomous coding on a consumer GPU with zero intervention and tell me speed is what matters. a year ago this wasn't possible. now it runs on hardware you can buy used for $900. no API. no subscription. no cloud. just a 3090 doing what data centers did 18 months ago. full unedited session in the video. every token, every file, every thinking chain. 16 minutes. hit play.
Sudo su91,135 Aufrufe • vor 3 Monaten

i pointed hermes agent at nvidia's nemotron cascade 2 30B-A3B on a single RTX 3090 24GB. IQ4_XS quant by bartowski, 187 tok/s, 625K context. had it discover its own hardware, create an identity file, then build a full GPU marketplace UI from a single prompt. it one shotted it. first attempt no iteration. qwen 3.5 35B-A3B on the same hardware same 3090 24GB took an iteration to recover from a blank screen on the same type of build. 24 days between these two models releasing. same active parameters, completely different architectures and cascade 2 through hermes agent just keeps going. this model goes on and on. feast your eyes. more iterations and tests dropping soon. nvidia really cooked. no special flags needed. nvidia optimized this mamba MoE so well it just runs. flash attention auto enabled, context auto allocated. the model does the work not the config. but i compiled llama.cpp from source and i'm not sure how it performs on other engines. if you ran nemotron on any hardware drop your numbers below. RTX, AMD, Mac, whatever. model, quant, tok/s, engine. i want to see if it holds everywhere or just on llama.cpp.
Sudo su70,579 Aufrufe • vor 2 Monaten

i gave a 9B model an agentic OS and told it to build Octopus Invaders from scratch. Qwen 3.5 9B running through Hermes Agent with full tool access on a single 3060 12GB. browser, file system, code execution, terminal. it reads the prompt, plans the architecture, writes every file, and serves the game. one prompt, zero steering, 11 minutes. i've been going back and forth with this model for days now. text, code, reasoning. it handles small chats, batching, bash scripting. at 9B it's fast enough to feel interactive and genuinely useful for local workflows where you'd normally reach for an API. but this is the real test. can it architect and build a full game autonomously without a single correction? it's coding right now. results coming next.
Sudo su64,482 Aufrufe • vor 2 Monaten

i been running Qwen3.5-35B-A3B UD-Q4_K_XL through Claude Code since llama.cpp merged the Anthropic endpoint. configured it in minutes. everything was great. projects grew from single scripts to multifile systems with 8 modules and 3,000+ lines. then the chains started breaking. 3 to 5 minutes of pure autonomy and suddenly it stops. tool call fails. reprompt. it recovers. 2 minutes later it stops again. the model is fine. the harness is the bottleneck. saw a comment suggesting OpenCode. installed it. pointed it at the same localhost endpoint running the same model on the same GPU. the game is different. instead of stopping on a bad tool call it just keeps going. on wrong read it adjusts. if file not found it retries. the flow is unbroken. i watched it plan a refactor across 8 files, read every module, and start building without a single pause. in Claude Code that same task would have stopped 4 times. the tradeoff is sometimes it loops. same tool call repeated because the model loses track of what it already read. but here is the thing. i choose loops over pauses. a loop you can interrupt and redirect. a broken chain stops the flow and you have to reprompt to get it moving again. someone is solving this at the core level and i have a feeling it is the open source community. the fact that i can run this level of autonomous coding intelligence on a single consumer GPU with 24gb VRAM at 112 tokens per second. respect to the chinese labs. respect to the open source builders making this possible.
Sudo su66,824 Aufrufe • vor 3 Monaten

hear this anon you don't need a $4,699 box to get started local AI. use what you already have first. test your workload. this is what a $250 GPU did today. iteration 3 of octopus invaders is here. 4 phases. 6 prompts. zero handwritten code. the same 9B on the same 3060 fixed its own enemy spawning, patched a dual start conflict, added level progression, resized every bullet, and when the browser cached old files it figured that out on its own and added version parameters to force reload. 3,200+ lines across 13 files. every line by qwen 3.5 9B Q4 at 35-50 tok/s on 12 gigs through hermes agent. understand what your load actually needs before you build. don't get trapped by influencers selling you boxes next to a plant. test on what you have. then decide. this 3060 impressed me in ways i did not expect and its autonomy is what kept me going. now its time to move to new experiments on other nodes and other models for all of us. if you are running this setup the exact stack, flags, and open source code, exact prompts i used are in the replies. if you run into issues let me know. seeing students and builders discover hermes from my posts and start running local is why i do this. full autonomous build at 8x speed in the video. gameplay at the end. watch it.
Sudo su51,665 Aufrufe • vor 2 Monaten