Sudo su's banner

Sudo su

@sudoingX • 32,298 subscribers

GPU/local LLM. more RAM and OSS... everywhere

Shorts

hey if you have a 3060, or any GPU with 8GB or more sitting in a drawer right now, that thing can run 9 billion parameters of intelligence autonomously. and you don't know it yet. 2 hours ago i posted that 9B hit a ceiling. 2,699 lines across 11 files. blank screen. said the limit for autonomous multifile coding on 9 billion parameters is real. then i audited every file. found 11 bugs. exact file, exact line, exact fix. duplicate variable declarations killing the script loader. a canvas reference never connected to the DOM. enemies with no movement logic. particle systems called on the class instead of the instance. fed that list as a single prompt to the same Qwen 3.5 9B on the same RTX 3060 through Hermes Agent. it fixed all 11. surgically. patch level edits across 4 files. no rewrites. no hallucinated changes. game boots. enemies spawn, move, collide. background renders. particles fire. and here's what nobody is talking about. this is a 9 billion parameter model running a full agentic framework. Hermes Agent with 31 tools. file operations, terminal, browser, code execution. not a single tool call failed. the agent chain never broke. most people think you need 70B+ for reliable tool use. this is 9B on 12 gigs doing it clean. the model didn't fail. my prompting strategy did. the ceiling is not the parameter count. the ceiling is how you prompt it. this is not done. bullets don't fire yet. boss fights need wiring. but the screen that was black 2 hours ago now has a full game rendering in real time. iterating right now. anyone with a GPU from the last 5 years should be paying attention to what is happening right now.

hey if you have a 3060, or any GPU with 8GB or more sitting in a drawer right now, that thing can run 9 billion parameters of intelligence autonomously. and you don't know it yet. 2 hours ago i posted that 9B hit a ceiling. 2,699 lines across 11 files. blank screen. said the limit for autonomous multifile coding on 9 billion parameters is real. then i audited every file. found 11 bugs. exact file, exact line, exact fix. duplicate variable declarations killing the script loader. a canvas reference never connected to the DOM. enemies with no movement logic. particle systems called on the class instead of the instance. fed that list as a single prompt to the same Qwen 3.5 9B on the same RTX 3060 through Hermes Agent. it fixed all 11. surgically. patch level edits across 4 files. no rewrites. no hallucinated changes. game boots. enemies spawn, move, collide. background renders. particles fire. and here's what nobody is talking about. this is a 9 billion parameter model running a full agentic framework. Hermes Agent with 31 tools. file operations, terminal, browser, code execution. not a single tool call failed. the agent chain never broke. most people think you need 70B+ for reliable tool use. this is 9B on 12 gigs doing it clean. the model didn't fail. my prompting strategy did. the ceiling is not the parameter count. the ceiling is how you prompt it. this is not done. bullets don't fire yet. boss fights need wiring. but the screen that was black 2 hours ago now has a full game rendering in real time. iterating right now. anyone with a GPU from the last 5 years should be paying attention to what is happening right now.

683,576 просмотров

there is so much real data just sitting in the open right now it's almost funny. four years of starlight on every star, a NASA archive that's been free for over a decade, detectors still recording the sky tonight, and barely anyone has a net pointed at any of it. so i pointed one. this is me pulling the planet data, the data loading is the boring part. the net i built to read it, the wall it hit, and what that taught me about where AI goes next, that's the full story, and it drops tonight. the data's public, the tools are free, the box fits on a desk. what's stopping you. you can just do things anon.

there is so much real data just sitting in the open right now it's almost funny. four years of starlight on every star, a NASA archive that's been free for over a decade, detectors still recording the sky tonight, and barely anyone has a net pointed at any of it. so i pointed one. this is me pulling the planet data, the data loading is the boring part. the net i built to read it, the wall it hit, and what that taught me about where AI goes next, that's the full story, and it drops tonight. the data's public, the tools are free, the box fits on a desk. what's stopping you. you can just do things anon.

60,445 просмотров

single RTX 3090. 24 GB VRAM. Qwen3.5-35B-A3B. 4-bit quant, 113 tokens per second at full 262K context harnessing Claude Code locally with no API, no subscription, no proxy. told it what it is. 30 Mamba2 layers, 10 attention, 256 experts, 8 active per token. said "build something that shows off what you can do." it visualized its own architecture. interactive. tokens flowing through layers. 256 experts lighting up on routing. served in the browser from the same GPU running inference. single prompt. then i said level up. 3D. Three.js. separate files. flythrough camera. clickable layers. it planned first, scaffolded 6 files, hit one API bug, fixed it itself, then optimized for smooth framerate. two iterations to a working 3D neural network explorer. llama.cpp just merged a native Anthropic endpoint. Claude Code points at localhost. the whole setup is two commands. no LiteLLM. no proxy config. the open source models coming out of china right now are genuinely changing what's possible on consumer hardware. respect to the Qwen team. this is acceleration.

single RTX 3090. 24 GB VRAM. Qwen3.5-35B-A3B. 4-bit quant, 113 tokens per second at full 262K context harnessing Claude Code locally with no API, no subscription, no proxy. told it what it is. 30 Mamba2 layers, 10 attention, 256 experts, 8 active per token. said "build something that shows off what you can do." it visualized its own architecture. interactive. tokens flowing through layers. 256 experts lighting up on routing. served in the browser from the same GPU running inference. single prompt. then i said level up. 3D. Three.js. separate files. flythrough camera. clickable layers. it planned first, scaffolded 6 files, hit one API bug, fixed it itself, then optimized for smooth framerate. two iterations to a working 3D neural network explorer. llama.cpp just merged a native Anthropic endpoint. Claude Code points at localhost. the whole setup is two commands. no LiteLLM. no proxy config. the open source models coming out of china right now are genuinely changing what's possible on consumer hardware. respect to the Qwen team. this is acceleration.

110,206 просмотров

this is the worst local AI will ever be. tomorrow it gets faster. next month the models get smarter. next year your GPU runs what a data center runs today. Qwen3.5-35B-A3B on a single 3090. told it to visualize its own expert routing. 256 experts, 8 active per token, rendered in 3D on the same GPU running inference. no API key. no subscription. no permission needed. closed AI isn't losing ground. it's losing the argument.

this is the worst local AI will ever be. tomorrow it gets faster. next month the models get smarter. next year your GPU runs what a data center runs today. Qwen3.5-35B-A3B on a single 3090. told it to visualize its own expert routing. 256 experts, 8 active per token, rendered in 3D on the same GPU running inference. no API key. no subscription. no permission needed. closed AI isn't losing ground. it's losing the argument.

106,710 просмотров

let me save you 3 hours of head scratching. if you're running local models like Qwen3.5-35B-A3B through Claude Code via llama.cpp's Anthropic endpoint, the chain will break every 3 to 5 minutes. tool call fails. flow stops. you reprompt. it recovers. 2 minutes later it stops again. the model is fine. the harness chokes on local inference latency. switch to OpenCode. same localhost endpoint. same model. same GPU. the chain doesn't break. the tradeoff: OpenCode sometimes loops. the model forgets what it already read and repeats the same tool call. but a loop you can interrupt. a broken chain kills your momentum and you start over. watch both side by side. proprietary agent vs open source agent. same 3B model. different failure modes. pick your poison.

let me save you 3 hours of head scratching. if you're running local models like Qwen3.5-35B-A3B through Claude Code via llama.cpp's Anthropic endpoint, the chain will break every 3 to 5 minutes. tool call fails. flow stops. you reprompt. it recovers. 2 minutes later it stops again. the model is fine. the harness chokes on local inference latency. switch to OpenCode. same localhost endpoint. same model. same GPU. the chain doesn't break. the tradeoff: OpenCode sometimes loops. the model forgets what it already read and repeats the same tool call. but a loop you can interrupt. a broken chain kills your momentum and you start over. watch both side by side. proprietary agent vs open source agent. same 3B model. different failure modes. pick your poison.

72,501 просмотров

this is the worst local ai will ever be. it only gets better from here. if you are not expanding your mind with these small models you are missing what's happening right now 99 percent tool call success rate. when steered well with the right skills and a framework like hermes agent the node becomes a cognition layer. not a chatbot. not a toy. an extension of how you think. i was cranking this node at 35 to 50 tok/s all day on personal experiments and now after all the work is done qwen 3.5 9B is iterating on its own code. the game it created. fixing its own bugs autonomously. and the part you should probably not miss is that all of this is happening on a RTX 3060. not an H100. not an A100. the card most of you have sitting in a drawer right now. if you just open that drawer and put that intelligence to work every tensor core on that card should be running for you. your work. your experiments. your thinking. you all have it but because nobody told you what this hardware can actually do in 2026 you never tried. the day it unlocks is the day you test your workload, understand the tradeoffs, debug the loops, and then decide if you need to scale the hardware. there is no point buying 3 mac studios when things done well you can squeeze a similar level of intelligence from 9B compared to 70B. but only when you create the right environment for your model through the right harness. and let me tell you i have tried claude code as a local harness. i have tried opencode. i have tried various others. somehow i landed on hermes agent and never left. there is something magical going on at Nous Research. the tool call parsers, the skills system, the way it handles small models natively. nothing else comes close for local inference. own your cognition. your AI. your agent. your prompts. your experiments. why give them away for free. those are who you are and they don't belong on someone else's servers being monitored. just give it a shot with your existing hardware. you run into a problem the community will help you. and if you are migrating from openclaw to hermes i will personally help you make the switch.

this is the worst local ai will ever be. it only gets better from here. if you are not expanding your mind with these small models you are missing what's happening right now 99 percent tool call success rate. when steered well with the right skills and a framework like hermes agent the node becomes a cognition layer. not a chatbot. not a toy. an extension of how you think. i was cranking this node at 35 to 50 tok/s all day on personal experiments and now after all the work is done qwen 3.5 9B is iterating on its own code. the game it created. fixing its own bugs autonomously. and the part you should probably not miss is that all of this is happening on a RTX 3060. not an H100. not an A100. the card most of you have sitting in a drawer right now. if you just open that drawer and put that intelligence to work every tensor core on that card should be running for you. your work. your experiments. your thinking. you all have it but because nobody told you what this hardware can actually do in 2026 you never tried. the day it unlocks is the day you test your workload, understand the tradeoffs, debug the loops, and then decide if you need to scale the hardware. there is no point buying 3 mac studios when things done well you can squeeze a similar level of intelligence from 9B compared to 70B. but only when you create the right environment for your model through the right harness. and let me tell you i have tried claude code as a local harness. i have tried opencode. i have tried various others. somehow i landed on hermes agent and never left. there is something magical going on at Nous Research. the tool call parsers, the skills system, the way it handles small models natively. nothing else comes close for local inference. own your cognition. your AI. your agent. your prompts. your experiments. why give them away for free. those are who you are and they don't belong on someone else's servers being monitored. just give it a shot with your existing hardware. you run into a problem the community will help you. and if you are migrating from openclaw to hermes i will personally help you make the switch.

58,717 просмотров

look what a single consumer GPU just built. gave Qwen3.5-35B-A3B one prompt: build a cloud GPU marketplace with pricing cards, deploy templates, and a benchmark leaderboard. it planned the layout, wrote the animations, populated the data, and served it. one shot. one HTML file. then i told it to iterate. split the hero, add a floating GPU with neural network animation. glassmorphism on the cards. done. done. done. three rounds, no confusion, no regressions. 4-bit quantized. 19.7 GB. single RTX 3090. full coding agent claude code harness running on localhost. no API calls leaving my machine. no subscription. no rate limits. earlier today i pointed it at my own production website. it curled the HTML, found every broken link, and told me "pretty shell, empty core. would not recommend." then built a better version from scratch. local inference stops being a demo when you actually steer it. the models are there. they understand intent. but you have to meet them halfway with good prompts, clear context, and real project structure. that's the skill gap now. not the models. the steering. more experiments coming. i genuinely cannot stop playing with this thing.

look what a single consumer GPU just built. gave Qwen3.5-35B-A3B one prompt: build a cloud GPU marketplace with pricing cards, deploy templates, and a benchmark leaderboard. it planned the layout, wrote the animations, populated the data, and served it. one shot. one HTML file. then i told it to iterate. split the hero, add a floating GPU with neural network animation. glassmorphism on the cards. done. done. done. three rounds, no confusion, no regressions. 4-bit quantized. 19.7 GB. single RTX 3090. full coding agent claude code harness running on localhost. no API calls leaving my machine. no subscription. no rate limits. earlier today i pointed it at my own production website. it curled the HTML, found every broken link, and told me "pretty shell, empty core. would not recommend." then built a better version from scratch. local inference stops being a demo when you actually steer it. the models are there. they understand intent. but you have to meet them halfway with good prompts, clear context, and real project structure. that's the skill gap now. not the models. the steering. more experiments coming. i genuinely cannot stop playing with this thing.

37,201 просмотров

here's how the whole thing works. claude code doesn't care what's behind the API. it just sends requests and expects responses. so i pointed it at my own machine instead of anthropic's servers. llama-server runs the model locally. LiteLLM sits in between and translates the API format. claude code thinks it's talking to claude. it's talking to qwen on localhost. the setup: 2x 3090s, 38 layers on GPU, 10 on CPU. 128K context window. generation is only 7 tok/s but the tradeoff is worth it. 128K means the agent can hold an entire project in memory without losing context midtask. claude code alone loads a 17.5K token system prompt on every request. tool definitions, safety rules, agent behavior. that's your baseline before you even say hello. pushed as far as i could tonight. what surprised me most wasn't the speed. it was the iteration quality. first prompt gave me a working particle sim. second prompt, the model read its own 564 lines, understood the architecture, and added trails, explosions, gravity wells, bloom effects. no handholding. 4bit quantized. 45GB on two consumer cards. running a full coding agent autonomously. detailed article coming. full benchmarks, hardware breakdowns, engine debugging, code quality. everything from setup to what broke and why.

here's how the whole thing works. claude code doesn't care what's behind the API. it just sends requests and expects responses. so i pointed it at my own machine instead of anthropic's servers. llama-server runs the model locally. LiteLLM sits in between and translates the API format. claude code thinks it's talking to claude. it's talking to qwen on localhost. the setup: 2x 3090s, 38 layers on GPU, 10 on CPU. 128K context window. generation is only 7 tok/s but the tradeoff is worth it. 128K means the agent can hold an entire project in memory without losing context midtask. claude code alone loads a 17.5K token system prompt on every request. tool definitions, safety rules, agent behavior. that's your baseline before you even say hello. pushed as far as i could tonight. what surprised me most wasn't the speed. it was the iteration quality. first prompt gave me a working particle sim. second prompt, the model read its own 564 lines, understood the architecture, and added trails, explosions, gravity wells, bloom effects. no handholding. 4bit quantized. 45GB on two consumer cards. running a full coding agent autonomously. detailed article coming. full benchmarks, hardware breakdowns, engine debugging, code quality. everything from setup to what broke and why.

37,623 просмотров

5 days ago it took 2 GPUs to build this. today it takes 1. same prompt. same particle simulation. completely different model. Qwen-Coder-Next (80B) on 2x 3090s. 46 tok/s. 564 lines. 2 iterations to get it working. 48GB VRAM across two cards just to hold it. Qwen3.5-35B-A3B on a single 3090. 112 tok/s. 461 lines. first try. cleaner code, fewer lines, better structured. 19.7GB on disk with 4GB VRAM to spare. half the parameters. one GPU instead of two. 2.4x faster. and the output actually improved. this is what happens when architecture catches up to ambition. Gated Delta Networks(Mamba2 variant) hybrid with sparse MoE. 3B active params out of 35B per token. efficiency at the architecture level, not just quantization. the curve isn't flattening. it's steepening.

5 days ago it took 2 GPUs to build this. today it takes 1. same prompt. same particle simulation. completely different model. Qwen-Coder-Next (80B) on 2x 3090s. 46 tok/s. 564 lines. 2 iterations to get it working. 48GB VRAM across two cards just to hold it. Qwen3.5-35B-A3B on a single 3090. 112 tok/s. 461 lines. first try. cleaner code, fewer lines, better structured. 19.7GB on disk with 4GB VRAM to spare. half the parameters. one GPU instead of two. 2.4x faster. and the output actually improved. this is what happens when architecture catches up to ambition. Gated Delta Networks(Mamba2 variant) hybrid with sparse MoE. 3B active params out of 35B per token. efficiency at the architecture level, not just quantization. the curve isn't flattening. it's steepening.

34,569 просмотров

the 24gb vram tier is enough for most builder work in 2026. gemma 4 31b dense on my rog scar 18 just autonomously built a production hero section in one prompt, one html file and 5 minutes end to end. hardware: rog scar 18, rtx 5090 laptop 24gb vram. model: google gemma 4 31b dense at q4_k_m quant, using 22.8 of 24gb. engine: llama.cpp built for blackwell (sm_120). harness: hermes agent with native tool parsing. speed: 15 tok/s sustained, 94 watts, 50c. flags i used: ./build/bin/llama-server -m ~/models/gemma4-31b/google_gemma-4-31B-it-Q4_K_M.gguf -ngl 99 -c 131072 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --jinja --host 127.0.0.1 --port 8080 if you own 24gb vram in 2026, you have enough for most ui work, most agentic coding, most autonomous builds. no subscription, no one logging your prompts. a dense open model on consumer hardware shipping real software on your desk. this was the warmup. full page next on same hardware, then the octopus invaders final multifile autonomous challenge.

the 24gb vram tier is enough for most builder work in 2026. gemma 4 31b dense on my rog scar 18 just autonomously built a production hero section in one prompt, one html file and 5 minutes end to end. hardware: rog scar 18, rtx 5090 laptop 24gb vram. model: google gemma 4 31b dense at q4_k_m quant, using 22.8 of 24gb. engine: llama.cpp built for blackwell (sm_120). harness: hermes agent with native tool parsing. speed: 15 tok/s sustained, 94 watts, 50c. flags i used: ./build/bin/llama-server -m ~/models/gemma4-31b/google_gemma-4-31B-it-Q4_K_M.gguf -ngl 99 -c 131072 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --jinja --host 127.0.0.1 --port 8080 if you own 24gb vram in 2026, you have enough for most ui work, most agentic coding, most autonomous builds. no subscription, no one logging your prompts. a dense open model on consumer hardware shipping real software on your desk. this was the warmup. full page next on same hardware, then the octopus invaders final multifile autonomous challenge.

19,576 просмотров

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

i think spacex browser game just fixed my fried focus, lol spent 20 min last night manually docking a crew dragon to the iss and my head went dead quiet for the first time in weeks. it's the real spacex sim, free, runs in the browser, no install. ease the translation in, null your rotation rate, hold the crosshairs dead center, soft capture. and that's the actual manual procedure the crew falls back on if autonomous docking ever drops. go dock something tonight.

i think spacex browser game just fixed my fried focus, lol spent 20 min last night manually docking a crew dragon to the iss and my head went dead quiet for the first time in weeks. it's the real spacex sim, free, runs in the browser, no install. ease the translation in, null your rotation rate, hold the crosshairs dead center, soft capture. and that's the actual manual procedure the crew falls back on if autonomous docking ever drops. go dock something tonight.

154,246 просмотров • 1 месяц назад

Qwopus on a single RTX 3090. Claude Opus 4.6 reasoning distilled into Qwen 3.5 27B dense, running through Claude's own coding agent (claude code). 29-35 tok/s with thinking mode on. the jinja bug that kills thinking on base Qwen doesn't carry over. harness and model matched. the base model would pause mid task on Claude Code. just stop generating. that's why i ran it through OpenCode, which handles stalled states automatically. this distilled version doesn't stall. it waits for tool outputs, reads them, selfcorrects when something breaks, and keeps going. i gave it a benchmark analysis task. went 9 minutes autonomous. wrote a README nobody asked for. zero steering. video is 5x speed but fully uncut. if you have a 3090, you can run this right now. free. no API. no subscription. opus structured reasoning on localhost. octopus invaders is next. same prompt that base qwen passed in 13 minutes and hermes 4.3 failed on 2x the hardware. i want to see if the distillation changes the outcome or just the style. more data soon.

Qwopus on a single RTX 3090. Claude Opus 4.6 reasoning distilled into Qwen 3.5 27B dense, running through Claude's own coding agent (claude code). 29-35 tok/s with thinking mode on. the jinja bug that kills thinking on base Qwen doesn't carry over. harness and model matched. the base model would pause mid task on Claude Code. just stop generating. that's why i ran it through OpenCode, which handles stalled states automatically. this distilled version doesn't stall. it waits for tool outputs, reads them, selfcorrects when something breaks, and keeps going. i gave it a benchmark analysis task. went 9 minutes autonomous. wrote a README nobody asked for. zero steering. video is 5x speed but fully uncut. if you have a 3090, you can run this right now. free. no API. no subscription. opus structured reasoning on localhost. octopus invaders is next. same prompt that base qwen passed in 13 minutes and hermes 4.3 failed on 2x the hardware. i want to see if the distillation changes the outcome or just the style. more data soon.

295,349 просмотров • 4 месяцев назад

update: qwen 3.6 27b dense q4 just one shotted octopus invaders game on a single 3090. hermes agent drove the whole thing, ~41 tok/s gen 21gb vram at full 262k context, thinking mode on. one prompt in and the canonical multi-file space shooter benchmark out, the same exact prompt i ran on qwen 3.5 27b dense back in march on the same card. 3.5 needed one external scope bug fix before the game would even load on first play. 3.6 needed nothing. 11 of 11 files written, 2411 lines of code, zero steering interventions, zero external fixes, playable on first load. 16 minutes 41 seconds wall clock from prompt to playable. consumer tier king on a single 3090 is locked tonight, and the silicon underneath my desk did not change between march and now. the open source ecosystem just moved the floor. watch it ship itself, the full 16 minutes 41 seconds sped to 3 minutes 45, no human touched the keyboard between the first prompt and the final frame.

update: qwen 3.6 27b dense q4 just one shotted octopus invaders game on a single 3090. hermes agent drove the whole thing, ~41 tok/s gen 21gb vram at full 262k context, thinking mode on. one prompt in and the canonical multi-file space shooter benchmark out, the same exact prompt i ran on qwen 3.5 27b dense back in march on the same card. 3.5 needed one external scope bug fix before the game would even load on first play. 3.6 needed nothing. 11 of 11 files written, 2411 lines of code, zero steering interventions, zero external fixes, playable on first load. 16 minutes 41 seconds wall clock from prompt to playable. consumer tier king on a single 3090 is locked tonight, and the silicon underneath my desk did not change between march and now. the open source ecosystem just moved the floor. watch it ship itself, the full 16 minutes 41 seconds sped to 3 minutes 45, no human touched the keyboard between the first prompt and the final frame.

123,781 просмотров • 2 месяцев назад

ok this is wild. 10 year old gtx 1080 8gb pascal card running qwen3 8b locally at 18-20 tok/s via hermes agent and it's actually doing the thing. asked it to build a wireworld cellular automata simulator with 10 tests. autonomous run, no hand holding. expected it to fail on the tool calls. that's not what happened. write_file works. browser_navigate works. terminal commands work. file ops, package installs, version probes, environment setup. agent is firing tool calls cleanly and the model is reasoning about next steps at 18-20 tok/s. on hardware that pre dates "agentic" as a word. it even hit an npm install fail because node 12 is too old. didn't crash. didn't ask me. just started bootstrapping nvm on its own to fix the environment. 10 minutes in. 40% context used. 7.5gb of 8gb vram occupied. still going. i did not think this would work on this hardware. this is the most i've been wrong this month.

ok this is wild. 10 year old gtx 1080 8gb pascal card running qwen3 8b locally at 18-20 tok/s via hermes agent and it's actually doing the thing. asked it to build a wireworld cellular automata simulator with 10 tests. autonomous run, no hand holding. expected it to fail on the tool calls. that's not what happened. write_file works. browser_navigate works. terminal commands work. file ops, package installs, version probes, environment setup. agent is firing tool calls cleanly and the model is reasoning about next steps at 18-20 tok/s. on hardware that pre dates "agentic" as a word. it even hit an npm install fail because node 12 is too old. didn't crash. didn't ask me. just started bootstrapping nvm on its own to fix the environment. 10 minutes in. 40% context used. 7.5gb of 8gb vram occupied. still going. i did not think this would work on this hardware. this is the most i've been wrong this month.

92,287 просмотров • 1 месяц назад

nvidia's 3B mamba destroyed alibaba's 3B deltanet on the same RTX 3090. only 24 days between releases. same active parameters, same VRAM tier, completely different architectures. nemotron cascade 2: 187 tok/s. flat from 4K to 625K context. zero speed loss. flags: -ngl 99 -np 1. that's it. no context flags, no KV cache tricks. auto-allocates 625K. qwen 3.5 35B-A3B: 112 tok/s. flat from 4K to 262K context. zero speed loss. flags: -ngl 99 -np 1 -c 262144 --cache-type-k q8_0 --cache-type-v q8_0. needed KV cache quantization to fit 262K. both models held a flat line across every context level. both architectures are context-independent. but nvidia's mamba2 is 67% faster at generating tokens on the exact same hardware and needs fewer flags to get there. same node, same GPU, same everything. the only variable is the model. gold medal math olympiad winner running at 187 tokens per second on single RTX 3090 a card from 6 years ago. nvidia cooked.

nvidia's 3B mamba destroyed alibaba's 3B deltanet on the same RTX 3090. only 24 days between releases. same active parameters, same VRAM tier, completely different architectures. nemotron cascade 2: 187 tok/s. flat from 4K to 625K context. zero speed loss. flags: -ngl 99 -np 1. that's it. no context flags, no KV cache tricks. auto-allocates 625K. qwen 3.5 35B-A3B: 112 tok/s. flat from 4K to 262K context. zero speed loss. flags: -ngl 99 -np 1 -c 262144 --cache-type-k q8_0 --cache-type-v q8_0. needed KV cache quantization to fit 262K. both models held a flat line across every context level. both architectures are context-independent. but nvidia's mamba2 is 67% faster at generating tokens on the exact same hardware and needs fewer flags to get there. same node, same GPU, same everything. the only variable is the model. gold medal math olympiad winner running at 187 tokens per second on single RTX 3090 a card from 6 years ago. nvidia cooked.

186,642 просмотров • 3 месяцев назад

first test results are in. qwen 3.6 27b dense just banged 10 out of 10 on a single rtx 3090 24gb tier at 40 tok/s. no quant tricks. no fused kernels. just q4_k_m straight cut on llama.cpp. i wrote a particle swarm benchmark this morning, fed it the prompt, and the model autonomously built a 500 particle boids flocking system. velocity driven hue, density based brightness, trail blend rendering, mouse attraction physics, click bursts, drag paint. then it used browser automation to test its own work, found the failing tests, iterated through the code, patched tests.js, and landed all 10 green on its own. i sat there hooked for 8 minutes playing. simple but mesmerizing. mouse trails build beautiful patterns, palette cycles with space, click sends particles flying, drag paints through the swarm. simplicity that hooks you. i'll open source this prompt and the build soon so anyone can reproduce it as their own benchmark. this is the first of 5 single file agent tests i wrote for this model. four more coming. octopus invaders flagship after as final. watch the full video below. see it autonomously build from one prompt. haven't slept well since this model dropped yesterday.

first test results are in. qwen 3.6 27b dense just banged 10 out of 10 on a single rtx 3090 24gb tier at 40 tok/s. no quant tricks. no fused kernels. just q4_k_m straight cut on llama.cpp. i wrote a particle swarm benchmark this morning, fed it the prompt, and the model autonomously built a 500 particle boids flocking system. velocity driven hue, density based brightness, trail blend rendering, mouse attraction physics, click bursts, drag paint. then it used browser automation to test its own work, found the failing tests, iterated through the code, patched tests.js, and landed all 10 green on its own. i sat there hooked for 8 minutes playing. simple but mesmerizing. mouse trails build beautiful patterns, palette cycles with space, click sends particles flying, drag paints through the swarm. simplicity that hooks you. i'll open source this prompt and the build soon so anyone can reproduce it as their own benchmark. this is the first of 5 single file agent tests i wrote for this model. four more coming. octopus invaders flagship after as final. watch the full video below. see it autonomously build from one prompt. haven't slept well since this model dropped yesterday.

132,605 просмотров • 2 месяцев назад

watch this anon. i gave NVIDIA's biggest model ever a single task. 100 minutes and 440,000 tokens later, it had rendered nothing. not one important thing on the screen. this is Nemotron 3 Ultra. 550 billion parameters, a hybrid Mamba Transformer MoE, the largest model NVIDIA has ever shipped, and they built it specifically for long-running agentic coding. so i handed it exactly that: build a 3D scene from a spec, multiple files, iterate until the tests pass. the same task a frontier model one shotted in minutes. i genuinely wanted to be impressed. it ran for an hour and forty. burned through 440,000 tokens. wrote every file, passed its own tests, and proudly printed "task complete."the browser was blank. the 3D scene never rendered. not once. and the long horizon agentic behavior was genuinely good. it stayed on task the whole hour and forty, wrote real multi-file code, drove its own tools without derailing. it just couldn't turn any of that into something that actually runs. here's the part that gets me. it's a text model, it cannot see its own output. so it sat there looping on a broken vision tool, trying to "look" at the page, hitting error after error, never once reasoning its way out. it declared victory on an empty screen because it had no way to know the screen was empty. to be fair, i genuinely don't know what quant the NIM was serving, so maybe some of that's on the serving, not the model. but the biggest model NVIDIA has ever made, on the exact task it was designed for, couldn't tell it had built nothing in 100 minutes. same task on a local model, below thread👇.

watch this anon. i gave NVIDIA's biggest model ever a single task. 100 minutes and 440,000 tokens later, it had rendered nothing. not one important thing on the screen. this is Nemotron 3 Ultra. 550 billion parameters, a hybrid Mamba Transformer MoE, the largest model NVIDIA has ever shipped, and they built it specifically for long-running agentic coding. so i handed it exactly that: build a 3D scene from a spec, multiple files, iterate until the tests pass. the same task a frontier model one shotted in minutes. i genuinely wanted to be impressed. it ran for an hour and forty. burned through 440,000 tokens. wrote every file, passed its own tests, and proudly printed "task complete."the browser was blank. the 3D scene never rendered. not once. and the long horizon agentic behavior was genuinely good. it stayed on task the whole hour and forty, wrote real multi-file code, drove its own tools without derailing. it just couldn't turn any of that into something that actually runs. here's the part that gets me. it's a text model, it cannot see its own output. so it sat there looping on a broken vision tool, trying to "look" at the page, hitting error after error, never once reasoning its way out. it declared victory on an empty screen because it had no way to know the screen was empty. to be fair, i genuinely don't know what quant the NIM was serving, so maybe some of that's on the serving, not the model. but the biggest model NVIDIA has ever made, on the exact task it was designed for, couldn't tell it had built nothing in 100 minutes. same task on a local model, below thread👇.

32,589 просмотров • 20 дней назад

this is what 12 gigs of VRAM built in 2026. a 9 billion parameter model running on a 5 year old RTX 3060 wrote a full space shooter from a single prompt. blank screen on first try. i came back with a bug list and the same model on the same card fixed every issue across 11 files without touching a single line myself. enemies still looked wrong so i pushed another iteration and now the game has pixel art octopi, particle effects, screen shake, projectile physics and a combo system. all running locally on a card that was designed to play fortnite. three iterations. zero cloud. zero API calls. every token generated on hardware sitting under my desk. the model reads its own code, finds what's broken, patches it, validates syntax and restarts the server. i just describe what's wrong and it handles the rest. people are paying monthly subscriptions to type into a browser tab and wait for a server farm to respond. meanwhile a GPU you can find used on ebay is running a full autonomous hermes agent framework with 31 tools, 128K context window and thinking mode generating at 29 tokens per second nonstop. the game still needs work. level upgrades don't trigger and boss fights need tuning. but the fact that i'm iterating on gameplay balance instead of debugging whether the code runs at all tells you where this is headed. every iteration the game gets better on the same hardware. same 12 gigs. same 9 billion parameters. same RTX 3060 from 5 years ago your GPU is not a gaming card anymore. it's a local AI lab that never sends your data anywhere.

this is what 12 gigs of VRAM built in 2026. a 9 billion parameter model running on a 5 year old RTX 3060 wrote a full space shooter from a single prompt. blank screen on first try. i came back with a bug list and the same model on the same card fixed every issue across 11 files without touching a single line myself. enemies still looked wrong so i pushed another iteration and now the game has pixel art octopi, particle effects, screen shake, projectile physics and a combo system. all running locally on a card that was designed to play fortnite. three iterations. zero cloud. zero API calls. every token generated on hardware sitting under my desk. the model reads its own code, finds what's broken, patches it, validates syntax and restarts the server. i just describe what's wrong and it handles the rest. people are paying monthly subscriptions to type into a browser tab and wait for a server farm to respond. meanwhile a GPU you can find used on ebay is running a full autonomous hermes agent framework with 31 tools, 128K context window and thinking mode generating at 29 tokens per second nonstop. the game still needs work. level upgrades don't trigger and boss fights need tuning. but the fact that i'm iterating on gameplay balance instead of debugging whether the code runs at all tells you where this is headed. every iteration the game gets better on the same hardware. same 12 gigs. same 9 billion parameters. same RTX 3060 from 5 years ago your GPU is not a gaming card anymore. it's a local AI lab that never sends your data anywhere.

170,305 просмотров • 4 месяцев назад

okay the fuss around hermes agent is not just air. this thing has substance. installed it on a single RTX 3090 running Qwen 3.5 27B base (Q4_K_M, 262K context, 29-35 tok/s). fully local. my machine my data. first thing i did was tell it to discover itself. find its own model weights, check its own GPU, read its own server flags, and write its own identity document. it did all of it autonomously. nvidia-smi, process grep, file writes. clean execution. the TUI is genuinely premium. dark theme, ASCII art, color coded tool calls with execution times, real time streaming. you actually enjoy watching it work. 29 tools. 80 skills (that's what it reports on boot). file ops, terminal, browser automation, code execution, cron scheduling, subagent delegation. and it has persistent memory across sessions. setup took 5 minutes. one curl install, setup wizard, point to localhost:8080/v1, done. dropping qwopus for this test btw. distilled models compress reasoning and lose precision on real coding tasks. base model only from here. more experiments coming. octopus invaders (the same game that broke qwopus) will be built using hermes agent next. comparing flow and results against claude code on the same model. if you want to run local AI agents on real hardware this one deserves a serious look.

okay the fuss around hermes agent is not just air. this thing has substance. installed it on a single RTX 3090 running Qwen 3.5 27B base (Q4_K_M, 262K context, 29-35 tok/s). fully local. my machine my data. first thing i did was tell it to discover itself. find its own model weights, check its own GPU, read its own server flags, and write its own identity document. it did all of it autonomously. nvidia-smi, process grep, file writes. clean execution. the TUI is genuinely premium. dark theme, ASCII art, color coded tool calls with execution times, real time streaming. you actually enjoy watching it work. 29 tools. 80 skills (that's what it reports on boot). file ops, terminal, browser automation, code execution, cron scheduling, subagent delegation. and it has persistent memory across sessions. setup took 5 minutes. one curl install, setup wizard, point to localhost:8080/v1, done. dropping qwopus for this test btw. distilled models compress reasoning and lose precision on real coding tasks. base model only from here. more experiments coming. octopus invaders (the same game that broke qwopus) will be built using hermes agent next. comparing flow and results against claude code on the same model. if you want to run local AI agents on real hardware this one deserves a serious look.

162,022 просмотров • 4 месяцев назад

testing Qwen3.5-35B-A3B latest optimized version by UnslothAI on a single RTX 3090. one detailed prompt. zero handholding. watch a 3B model scaffold an entire multifile game project autonomously. the setup: > model: Qwen3.5-35B-A3B (80B total, only 3B active per token) > quant: UD-Q4_K_XL by Unsloth (MXFP4 layers removed in latest update) > speed: 112 tok/s generation, ~130 tok/s prefill > context: 262K tokens > flags: -ngl 99 -c 262144 -np 1 --cache-type-k q8_0 --cache-type-v q8_0 > engine: llama.cpp > agent: Claude Code talk to localhost:8080 (llama.cpp now has native Anthropic API endpoint. no LiteLLM needed) q8_0 KV cache cuts VRAM usage in half vs f16 at 262K. -np 1 is default but worth noting. parallel slots multiply KV cache and at 262K that's an instant OOM. the prompt was more detailed than this but you get the idea: build a space shooter with parallax backgrounds, particle systems, procedural audio, 4 enemy types, boss fights, power-up system, and ship upgrades. 8 JavaScript modules. no libraries. game's called Octopus Invaders. gameplay footage dropping next.

testing Qwen3.5-35B-A3B latest optimized version by UnslothAI on a single RTX 3090. one detailed prompt. zero handholding. watch a 3B model scaffold an entire multifile game project autonomously. the setup: > model: Qwen3.5-35B-A3B (80B total, only 3B active per token) > quant: UD-Q4_K_XL by Unsloth (MXFP4 layers removed in latest update) > speed: 112 tok/s generation, ~130 tok/s prefill > context: 262K tokens > flags: -ngl 99 -c 262144 -np 1 --cache-type-k q8_0 --cache-type-v q8_0 > engine: llama.cpp > agent: Claude Code talk to localhost:8080 (llama.cpp now has native Anthropic API endpoint. no LiteLLM needed) q8_0 KV cache cuts VRAM usage in half vs f16 at 262K. -np 1 is default but worth noting. parallel slots multiply KV cache and at 262K that's an instant OOM. the prompt was more detailed than this but you get the idea: build a space shooter with parallax backgrounds, particle systems, procedural audio, 4 enemy types, boss fights, power-up system, and ship upgrades. 8 JavaScript modules. no libraries. game's called Octopus Invaders. gameplay footage dropping next.

167,035 просмотров • 4 месяцев назад

watch and let this sit for a second anon, a 35B AI model, building a complete multi-file game by itself on hermes agent. no cloud or api. not one line of code written by a human. running on a DGX spark in 2026. because not long ago this was science fiction. this is Ornith, a new open agentic coding model, running near lossless on a single DGX Spark, a 128GB box that sits on your desk. i hand it one spec and step back. it reasons about the architecture, creates the project structure, writes every module one by one, and here's the part that got me, it checks its own work as it goes. makes a call, reads the result, decides if it's right, corrects itself before moving on. no steering from me. this is what people keep underestimating about local AI. they hear "local" and think weaker, slower, a toy. watch the clip. a 35B at near full precision, planning and self correcting on a single desktop. the open models got genuinely good at agentic work, not benchmark good, actually builds the thing good. sped 5x so you can watch the whole build start to finish. this is where local AI actually is now. not coming soon. here.

watch and let this sit for a second anon, a 35B AI model, building a complete multi-file game by itself on hermes agent. no cloud or api. not one line of code written by a human. running on a DGX spark in 2026. because not long ago this was science fiction. this is Ornith, a new open agentic coding model, running near lossless on a single DGX Spark, a 128GB box that sits on your desk. i hand it one spec and step back. it reasons about the architecture, creates the project structure, writes every module one by one, and here's the part that got me, it checks its own work as it goes. makes a call, reads the result, decides if it's right, corrects itself before moving on. no steering from me. this is what people keep underestimating about local AI. they hear "local" and think weaker, slower, a toy. watch the clip. a 35B at near full precision, planning and self correcting on a single desktop. the open models got genuinely good at agentic work, not benchmark good, actually builds the thing good. sped 5x so you can watch the whole build start to finish. this is where local AI actually is now. not coming soon. here.

30,299 просмотров • 22 дней назад

this is what a 24gb VRAM builds in 2026. one prompt. ten files. 3,483 lines of code. zero handholding. i gave Qwen3.5-35B-A3B a single detailed spec describing the full game architecture and hit enter. enemy types, particle systems, procedural audio, powerups, boss fights, ship upgrades, parallax backgrounds, everything in one message. the model planned the file structure itself, wrote every module in dependency order, wired all the imports, and served the game on port 3001. it ran on first load. when it hit a bug in collision detection it read its own error output, found the issue, fixed it, and kept building. this is pure agent loop running on local hardware. what you're looking at is pixelated octopus aliens with tentacle animations, 4 layer parallax space background with planets at different depths, a full particle system handling explosions and ink splatter and engine trails and bullet impacts, procedural audio through Web Audio API with zero sound files loaded, unleash mode with combo multiplier, boss fights every 5 levels, ship upgrades that unlock as you progress. no libraries. no frameworks. vanilla JS and Canvas. 3B active parameters. single RTX 3090. llama.cpp with q8_0 KV cache at 262K context. Claude Code pointed at localhost:8080 through the native Anthropic endpoint. no API costs. 112 tok/s. a GPU you can buy used for $800. game is called Octopus Invaders and i actually like playing it.

this is what a 24gb VRAM builds in 2026. one prompt. ten files. 3,483 lines of code. zero handholding. i gave Qwen3.5-35B-A3B a single detailed spec describing the full game architecture and hit enter. enemy types, particle systems, procedural audio, powerups, boss fights, ship upgrades, parallax backgrounds, everything in one message. the model planned the file structure itself, wrote every module in dependency order, wired all the imports, and served the game on port 3001. it ran on first load. when it hit a bug in collision detection it read its own error output, found the issue, fixed it, and kept building. this is pure agent loop running on local hardware. what you're looking at is pixelated octopus aliens with tentacle animations, 4 layer parallax space background with planets at different depths, a full particle system handling explosions and ink splatter and engine trails and bullet impacts, procedural audio through Web Audio API with zero sound files loaded, unleash mode with combo multiplier, boss fights every 5 levels, ship upgrades that unlock as you progress. no libraries. no frameworks. vanilla JS and Canvas. 3B active parameters. single RTX 3090. llama.cpp with q8_0 KV cache at 262K context. Claude Code pointed at localhost:8080 through the native Anthropic endpoint. no API costs. 112 tok/s. a GPU you can buy used for $800. game is called Octopus Invaders and i actually like playing it.

153,735 просмотров • 4 месяцев назад

looks like gemma 4 12b is trying to kill my rtx 3090. pinned at 100%, 216 watts, 72 degrees, fans screaming like the thing wants to leave the case. you watch it for two seconds and you're sure the card is completely maxed, redlining, about to tap out. then you look at the memory bar. 16 gigs. out of 24. and that's at full 256k context. it's not dying, it's barely warmed up. one cheap consumer card, fully local, cooking away with 8 whole gigs to spare. everybody swears long context eats your vram alive. watch the bar. not here. and honestly the speed was never the real question. whether a 12b can hold up doing actual agentic work, that's the one running right now.

looks like gemma 4 12b is trying to kill my rtx 3090. pinned at 100%, 216 watts, 72 degrees, fans screaming like the thing wants to leave the case. you watch it for two seconds and you're sure the card is completely maxed, redlining, about to tap out. then you look at the memory bar. 16 gigs. out of 24. and that's at full 256k context. it's not dying, it's barely warmed up. one cheap consumer card, fully local, cooking away with 8 whole gigs to spare. everybody swears long context eats your vram alive. watch the bar. not here. and honestly the speed was never the real question. whether a 12b can hold up doing actual agentic work, that's the one running right now.

48,113 просмотров • 1 месяц назад

first impressions of qwen 3.5 27B dense on a single RTX 3090. 35 tok/s. from 4K all the way to 300K+ context. no speed drop. hermes 4.3 started at 35 and degraded to 15 as context filled. qwen dense holds. MoE held 112 flat. 3x faster but only 3B of 35B active per token. architecture tradeoff. Q4_K_M on 16.7GB. native context 262K. pushed past training limit to 376K before VRAM ceiling on 24GB. tried q8 KV cache at 262K, speed collapsed to 11 tok/s. q4_0 KV is the sweet spot. flash attention mandatory. built in reasoning mode. the model thinks step by step before it answers. full chain of thought surviving Q4 quant. 1,799+ token thinking chains with self correction loops. on a single consumer GPU. gave it one prompt: "build a realtime particle galaxy simulation in one HTML file." 3,340 tokens. 95 seconds. one shot. ran on first load. full reasoning and coding in the video below. optimal config if you want to skip the hours of testing: llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 this is just the warmup. octopus invaders is next: 10 files, 3,400+ lines, zero steering. the prompt hermes quit at 22%. already more impressed than expected. full results coming soon.

first impressions of qwen 3.5 27B dense on a single RTX 3090. 35 tok/s. from 4K all the way to 300K+ context. no speed drop. hermes 4.3 started at 35 and degraded to 15 as context filled. qwen dense holds. MoE held 112 flat. 3x faster but only 3B of 35B active per token. architecture tradeoff. Q4_K_M on 16.7GB. native context 262K. pushed past training limit to 376K before VRAM ceiling on 24GB. tried q8 KV cache at 262K, speed collapsed to 11 tok/s. q4_0 KV is the sweet spot. flash attention mandatory. built in reasoning mode. the model thinks step by step before it answers. full chain of thought surviving Q4 quant. 1,799+ token thinking chains with self correction loops. on a single consumer GPU. gave it one prompt: "build a realtime particle galaxy simulation in one HTML file." 3,340 tokens. 95 seconds. one shot. ran on first load. full reasoning and coding in the video below. optimal config if you want to skip the hours of testing: llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 this is just the warmup. octopus invaders is next: 10 files, 3,400+ lines, zero steering. the prompt hermes quit at 22%. already more impressed than expected. full results coming soon.

120,788 просмотров • 4 месяцев назад

i watched gemma 4 12b build something genuinely impressive today, and then loop itself to death right in front of me. the full run is in the video, sped up but completely uncut, watch it to the end and you will catch the exact moment it stops building and starts looping right in the middle of the work. the task was clean, build a single file gravity simulator, n-body physics, orbits, collisions, running locally on one 3090 through an agent. and for ten minutes it was a joy to watch. it reached for a symplectic integrator on its own, the correct one, the kind that keeps orbits stable instead of spiralling out. real gravity with softening, proper orbital velocities, momentum conserved on collision. the physics was right. the thing actually worked. then on the very last step, writing a few tests to prove its own code, it fell into a loop. not a crash, a loop. it started repeating itself and would not stop. ten more minutes, thirty four thousand tokens into a single answer, the same fragments over and over, until i killed it myself. so it's not that gemma can't code. it did the hard part beautifully. it cannot finish. it cannot hold a long task together without unravelling, and finishing is the entire job in agentic work. here's the part that stings. i run this exact task, same harness, same card, on the chinese open models, qwen especially, and i never see this. they build it, they test it, they stop. every single time. google has the raw capability, you can see it sitting right there in the code, and then the model loops itself to death on a task a 27b from alibaba finishes clean. open weights, apache 2.0, so much to love on paper. i just need it to know when to stop talking.

i watched gemma 4 12b build something genuinely impressive today, and then loop itself to death right in front of me. the full run is in the video, sped up but completely uncut, watch it to the end and you will catch the exact moment it stops building and starts looping right in the middle of the work. the task was clean, build a single file gravity simulator, n-body physics, orbits, collisions, running locally on one 3090 through an agent. and for ten minutes it was a joy to watch. it reached for a symplectic integrator on its own, the correct one, the kind that keeps orbits stable instead of spiralling out. real gravity with softening, proper orbital velocities, momentum conserved on collision. the physics was right. the thing actually worked. then on the very last step, writing a few tests to prove its own code, it fell into a loop. not a crash, a loop. it started repeating itself and would not stop. ten more minutes, thirty four thousand tokens into a single answer, the same fragments over and over, until i killed it myself. so it's not that gemma can't code. it did the hard part beautifully. it cannot finish. it cannot hold a long task together without unravelling, and finishing is the entire job in agentic work. here's the part that stings. i run this exact task, same harness, same card, on the chinese open models, qwen especially, and i never see this. they build it, they test it, they stop. every single time. google has the raw capability, you can see it sitting right there in the code, and then the model loops itself to death on a task a 27b from alibaba finishes clean. open weights, apache 2.0, so much to love on paper. i just need it to know when to stop talking.

39,574 просмотров • 1 месяц назад

i'm running a 397 billion parameter model on a amd ai max box that sits on my desk and pulls less power than a gaming laptop. the model is Nex-N2-Pro, 397B-A17B, the open weight release people are putting next to gpt-5.5 on coding. i have it quantized to IQ1_M, 1.75 bits per weight, 90gb of weights loaded into the 128gb of unified memory on amd's strix halo igpu. watch the gpu in this recording. it spikes, it sustains, it does not fall over. that is the part the spec sheets never show you, not just that a 400b model loads, but that an integrated graphics chip holds the load and generates token after token, stable, no crash, no thermal cliff. and it is not a slideshow. roughly 18 tokens a second, faster than you can read. a frontier scale model producing usable output, fully local. no datacenter, no rented h100s, no api key, no permission. three years ago a model this size meant a server room and a budget to match. tonight it is a quiet box on my desk. this is the accessible tier almost nobody benchmarks honestly, and it is further along than the timeline thinks. the full breakdown is coming, rocm vs vulkan on this chip, and this little amd box head to head against the nvidia equivalent. stay tuned.

i'm running a 397 billion parameter model on a amd ai max box that sits on my desk and pulls less power than a gaming laptop. the model is Nex-N2-Pro, 397B-A17B, the open weight release people are putting next to gpt-5.5 on coding. i have it quantized to IQ1_M, 1.75 bits per weight, 90gb of weights loaded into the 128gb of unified memory on amd's strix halo igpu. watch the gpu in this recording. it spikes, it sustains, it does not fall over. that is the part the spec sheets never show you, not just that a 400b model loads, but that an integrated graphics chip holds the load and generates token after token, stable, no crash, no thermal cliff. and it is not a slideshow. roughly 18 tokens a second, faster than you can read. a frontier scale model producing usable output, fully local. no datacenter, no rented h100s, no api key, no permission. three years ago a model this size meant a server room and a budget to match. tonight it is a quiet box on my desk. this is the accessible tier almost nobody benchmarks honestly, and it is further along than the timeline thinks. the full breakdown is coming, rocm vs vulkan on this chip, and this little amd box head to head against the nvidia equivalent. stay tuned.

32,163 просмотров • 1 месяц назад

single RTX 3090. 24 GB VRAM. Qwen3.5-35B-A3B. 4-bit quant, 113 tokens per second at full 262K context harnessing Claude Code locally with no API, no subscription, no proxy. told it what it is. 30 Mamba2 layers, 10 attention, 256 experts, 8 active per token. said "build something that shows off what you can do." it visualized its own architecture. interactive. tokens flowing through layers. 256 experts lighting up on routing. served in the browser from the same GPU running inference. single prompt. then i said level up. 3D. Three.js. separate files. flythrough camera. clickable layers. it planned first, scaffolded 6 files, hit one API bug, fixed it itself, then optimized for smooth framerate. two iterations to a working 3D neural network explorer. llama.cpp just merged a native Anthropic endpoint. Claude Code points at localhost. the whole setup is two commands. no LiteLLM. no proxy config. the open source models coming out of china right now are genuinely changing what's possible on consumer hardware. respect to the Qwen team. this is acceleration.

single RTX 3090. 24 GB VRAM. Qwen3.5-35B-A3B. 4-bit quant, 113 tokens per second at full 262K context harnessing Claude Code locally with no API, no subscription, no proxy. told it what it is. 30 Mamba2 layers, 10 attention, 256 experts, 8 active per token. said "build something that shows off what you can do." it visualized its own architecture. interactive. tokens flowing through layers. 256 experts lighting up on routing. served in the browser from the same GPU running inference. single prompt. then i said level up. 3D. Three.js. separate files. flythrough camera. clickable layers. it planned first, scaffolded 6 files, hit one API bug, fixed it itself, then optimized for smooth framerate. two iterations to a working 3D neural network explorer. llama.cpp just merged a native Anthropic endpoint. Claude Code points at localhost. the whole setup is two commands. no LiteLLM. no proxy config. the open source models coming out of china right now are genuinely changing what's possible on consumer hardware. respect to the Qwen team. this is acceleration.

110,206 просмотров • 4 месяцев назад

been playing with hermes agent paired with qwen 3.5 dense 27B on my single 3090 since last night. there is something about this harness that caught me and i think i know what it is. i've now run five qwen configs on consumer hardware: 35B MoE (3B active) -- 112 tok/s flat across 262K context, 1x 3090 27B dense -- 35 tok/s, zero degradation across the same range, 1x 3090 qwopus 27B (opus distilled) -- 35.7 tok/s, same architecture, different brain 80B coder -- 46 tok/s on 2x 3090s, oneshotted a 564 line particle sim 80B coder -- 1.3 tok/s on 1x 3090, bleeding through RAM because it didn't fit but it still ran with same benchmarks. same prompts. same quant where possible. every config is documented. i know these models. and hermes agent is the first harness that feels like it respects that work. tool calls show inline with execution time. nvidia-smi 0.2s. write_file 0.7s. you see exactly what the agent is doing and how long each step takes. no mystery. no black box. no tool call failures so far and i've been pushing it. most agent frameworks feel like you're watching a spinner and hoping. hermes shows the work. that transparency changes how you trust the output. once you use it you see the UX decisions are not accidental. Teknium 🪽 and the nous team built this like engineers who actually use their own tools. 80 skills. 29 tools. persistent memory. context compression. runs clean on a single consumer GPU.

been playing with hermes agent paired with qwen 3.5 dense 27B on my single 3090 since last night. there is something about this harness that caught me and i think i know what it is. i've now run five qwen configs on consumer hardware: 35B MoE (3B active) -- 112 tok/s flat across 262K context, 1x 3090 27B dense -- 35 tok/s, zero degradation across the same range, 1x 3090 qwopus 27B (opus distilled) -- 35.7 tok/s, same architecture, different brain 80B coder -- 46 tok/s on 2x 3090s, oneshotted a 564 line particle sim 80B coder -- 1.3 tok/s on 1x 3090, bleeding through RAM because it didn't fit but it still ran with same benchmarks. same prompts. same quant where possible. every config is documented. i know these models. and hermes agent is the first harness that feels like it respects that work. tool calls show inline with execution time. nvidia-smi 0.2s. write_file 0.7s. you see exactly what the agent is doing and how long each step takes. no mystery. no black box. no tool call failures so far and i've been pushing it. most agent frameworks feel like you're watching a spinner and hoping. hermes shows the work. that transparency changes how you trust the output. once you use it you see the UX decisions are not accidental. Teknium 🪽 and the nous team built this like engineers who actually use their own tools. 80 skills. 29 tools. persistent memory. context compression. runs clean on a single consumer GPU.

100,168 просмотров • 4 месяцев назад

watch gemma 4 12b q8 dancing on a single rtx 3090 at 33 tokens a second average. google dropped this two days ago and it's the kind of thing that quietly moves the floor. a fully multimodal model, text image and audio in one net, 256k context, apache licensed, running entirely on one consumer gpu, no one metering your tokens. what you're watching is the whole loop live: the server streaming tokens top left, the gpu pegged bottom left, the answer landing on the right. all local, all mine. a year ago this needed someone else's datacenter. today it's a card you can buy. open source isn't catching up anymore, it's setting the pace. how fast does yours run?

watch gemma 4 12b q8 dancing on a single rtx 3090 at 33 tokens a second average. google dropped this two days ago and it's the kind of thing that quietly moves the floor. a fully multimodal model, text image and audio in one net, 256k context, apache licensed, running entirely on one consumer gpu, no one metering your tokens. what you're watching is the whole loop live: the server streaming tokens top left, the gpu pegged bottom left, the answer landing on the right. all local, all mine. a year ago this needed someone else's datacenter. today it's a card you can buy. open source isn't catching up anymore, it's setting the pace. how fast does yours run?

36,614 просмотров • 1 месяц назад

this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 of 24gb vram gone, 94 watts at 50c. no api keys. no rate limits. no "your prompts are being used for training". no monthly subscription. no anthropic telling me what i can and cant ask. no openai logging my work. no outages when aws goes down. just google deepmind's open weights, open source llama.cpp, nous research's hermes agent, a rog scar 18 on my desk, and 95 watts of sustained compute while it builds stuff on its own. the laptop is roaring. results incoming.

this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 of 24gb vram gone, 94 watts at 50c. no api keys. no rate limits. no "your prompts are being used for training". no monthly subscription. no anthropic telling me what i can and cant ask. no openai logging my work. no outages when aws goes down. just google deepmind's open weights, open source llama.cpp, nous research's hermes agent, a rog scar 18 on my desk, and 95 watts of sustained compute while it builds stuff on its own. the laptop is roaring. results incoming.

65,567 просмотров • 3 месяцев назад