Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

A $351 MINI PC IS RUNNING 26-BILLION-PARAMETER AI MODELS AT 20 TOKENS/SEC AND HERMES AGENT ON TOP OF IT This is the Minisforum UM790 Pro. AMD Ryzen 9 7940HS, Radeon 780M iGPU, 48GB DDR5. The BIOS reports the GPU has 4GB of VRAM Here's the part people get wrong.... The 780M has no dedicated VRAM at all it borrows from system RAM. Vulkan ignores the BIOS number and reads the full 48GB pool directly That's the whole trick. 21+ GB allocated to model weights on a "4GB" GPU. Because the models are MoE, only 4 billion of the parameters activate per token a fraction of the reads a dense model needs. That's why 20 tok/s works here Gemma 4 26B MoE holds 19.5 tok/s with 196K context. Qwen3.5-35B-A3B holds 20.8. Nemotron Cascade 2 clears 24.8. A dense 31B, by contrast, drops to 4 tok/s it reads the entire model every token, no way around it On top: Hermes Agent, full agentic workflows terminal, file ops, web, 40+ tools against local models only. No API keys. The wall between you and a usable local agent used to be a GPU you couldn't afford. Now it's a BIOS setting most people never check Bookmark this & Try it yourself ↓show more

slash1s

8,689 subscribers

35,742 görüntüleme • 15 gün önce •via X (Twitter)

Oyun Eğitim Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

HERMES AGENT NOW RUNS ON AN 8GB LAPTOP GPU JUST AS EASILY AS IT RUNS ON A 128GB MINI PC Nous Research shipped the official Hermes Agent Desktop App this week. Someone pointed it at a local llama server running on an RTX 4060 with 16GB system RAM. The integration took two minutes The model behind it: Gemma 4 26B MoE, QAT quantized, running on 8GB of VRAM. A 60k token prompt held a stable 20 tokens a second, flat, no slowdown as context grew. The flags were nothing exotic, just -cmoe -c 248000 on llama.cpp What that 8GB setup does out of the box: reads and patches its own code, runs it in a terminal, debugs errors, manages GitHub repos, spawns sub-agents for parallel work. Browses the web with vision to debug a UI. Schedules cron jobs in plain language. Connects to Notion, Google Workspace, Linear, and Obsidian to manage tasks on its own That's the same agent layer running on a Minisforum MS-S1 MAX with 128GB of unified memory, 96GB of it to the GPU, holding a 120B model at 56 tokens a second instead of a 26B model at 20. Same software, same tool execution, same zero API key. The only thing that changes between an $800 laptop and a $2,000 mini PC is how big a model you can afford to run underneath it The barrier to running a real autonomous agent locally didn't just drop. It dropped all the way down to hardware most people already own

HERMES AGENT NOW RUNS ON AN 8GB LAPTOP GPU JUST AS EASILY AS IT RUNS ON A 128GB MINI PC Nous Research shipped the official Hermes Agent Desktop App this week. Someone pointed it at a local llama server running on an RTX 4060 with 16GB system RAM. The integration took two minutes The model behind it: Gemma 4 26B MoE, QAT quantized, running on 8GB of VRAM. A 60k token prompt held a stable 20 tokens a second, flat, no slowdown as context grew. The flags were nothing exotic, just -cmoe -c 248000 on llama.cpp What that 8GB setup does out of the box: reads and patches its own code, runs it in a terminal, debugs errors, manages GitHub repos, spawns sub-agents for parallel work. Browses the web with vision to debug a UI. Schedules cron jobs in plain language. Connects to Notion, Google Workspace, Linear, and Obsidian to manage tasks on its own That's the same agent layer running on a Minisforum MS-S1 MAX with 128GB of unified memory, 96GB of it to the GPU, holding a 120B model at 56 tokens a second instead of a 26B model at 20. Same software, same tool execution, same zero API key. The only thing that changes between an $800 laptop and a $2,000 mini PC is how big a model you can afford to run underneath it The barrier to running a real autonomous agent locally didn't just drop. It dropped all the way down to hardware most people already own

NO1ennn

40,079 görüntüleme • 14 gün önce

i pointed hermes agent at nvidia's nemotron cascade 2 30B-A3B on a single RTX 3090 24GB. IQ4_XS quant by bartowski, 187 tok/s, 625K context. had it discover its own hardware, create an identity file, then build a full GPU marketplace UI from a single prompt. it one shotted it. first attempt no iteration. qwen 3.5 35B-A3B on the same hardware same 3090 24GB took an iteration to recover from a blank screen on the same type of build. 24 days between these two models releasing. same active parameters, completely different architectures and cascade 2 through hermes agent just keeps going. this model goes on and on. feast your eyes. more iterations and tests dropping soon. nvidia really cooked. no special flags needed. nvidia optimized this mamba MoE so well it just runs. flash attention auto enabled, context auto allocated. the model does the work not the config. but i compiled llama.cpp from source and i'm not sure how it performs on other engines. if you ran nemotron on any hardware drop your numbers below. RTX, AMD, Mac, whatever. model, quant, tok/s, engine. i want to see if it holds everywhere or just on llama.cpp.

i pointed hermes agent at nvidia's nemotron cascade 2 30B-A3B on a single RTX 3090 24GB. IQ4_XS quant by bartowski, 187 tok/s, 625K context. had it discover its own hardware, create an identity file, then build a full GPU marketplace UI from a single prompt. it one shotted it. first attempt no iteration. qwen 3.5 35B-A3B on the same hardware same 3090 24GB took an iteration to recover from a blank screen on the same type of build. 24 days between these two models releasing. same active parameters, completely different architectures and cascade 2 through hermes agent just keeps going. this model goes on and on. feast your eyes. more iterations and tests dropping soon. nvidia really cooked. no special flags needed. nvidia optimized this mamba MoE so well it just runs. flash attention auto enabled, context auto allocated. the model does the work not the config. but i compiled llama.cpp from source and i'm not sure how it performs on other engines. if you ran nemotron on any hardware drop your numbers below. RTX, AMD, Mac, whatever. model, quant, tok/s, engine. i want to see if it holds everywhere or just on llama.cpp.

Sudo su

70,791 görüntüleme • 3 ay önce

a new 8GB VRAM GPU dense Local LLM leader was born yesterday runs on: RTX 4060 / RTX 3070 / RTX 2080. any 8GB card Qwen 3.5 9B (dense) was the go to for 6-8GB VRAM builds. Gemma 4 12B QAT (dense) just changed that. same llama.cpp + cuda 13.2. i7 12700H. 16GB RAM. same -ngl 99 flags. same 48k context. unsloth gemma-4-12b-it-Q4_K_M.gguf → 15 tok/sec @ 48k ctx unsloth gemma-4-12B-it-qat-UD-Q4_K_XL.gguf → 32 tok/sec @ 48k ctx → 26 tok/sec @ 64k ctx 64k context is a big deal. Hermes 3 agent requires 64k minimum to run. you're now getting full hermes compatible context on a budget consumer GPU at 26 tok/sec locally. 2.1x faster on identical hardware. and here's the part that breaks your brain: the QAT-UD-Q4_K_XL is actually SMALLER than the Q4_K_M "XL" why? QAT = Quantization Aware Training Google didn't train the model first and compress it later they trained it to be quantized from day one the weights already know how to survive low precision that's why you get more quality per byte llamacpp flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 48000 -v fits in 8GB VRAM clean. no API. no cloud. no subscription. and this isn't even the MTP variant yet Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB. I have benchmarked the 26b and 31b qat as well on a single RTX 4090, checkout the comments for details. If you have a 6GB or 8GB VRAM GPU, post your numbers. more benchmarks and configs coming soon

a new 8GB VRAM GPU dense Local LLM leader was born yesterday runs on: RTX 4060 / RTX 3070 / RTX 2080. any 8GB card Qwen 3.5 9B (dense) was the go to for 6-8GB VRAM builds. Gemma 4 12B QAT (dense) just changed that. same llama.cpp + cuda 13.2. i7 12700H. 16GB RAM. same -ngl 99 flags. same 48k context. unsloth gemma-4-12b-it-Q4_K_M.gguf → 15 tok/sec @ 48k ctx unsloth gemma-4-12B-it-qat-UD-Q4_K_XL.gguf → 32 tok/sec @ 48k ctx → 26 tok/sec @ 64k ctx 64k context is a big deal. Hermes 3 agent requires 64k minimum to run. you're now getting full hermes compatible context on a budget consumer GPU at 26 tok/sec locally. 2.1x faster on identical hardware. and here's the part that breaks your brain: the QAT-UD-Q4_K_XL is actually SMALLER than the Q4_K_M "XL" why? QAT = Quantization Aware Training Google didn't train the model first and compress it later they trained it to be quantized from day one the weights already know how to survive low precision that's why you get more quality per byte llamacpp flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 48000 -v fits in 8GB VRAM clean. no API. no cloud. no subscription. and this isn't even the MTP variant yet Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB. I have benchmarked the 26b and 31b qat as well on a single RTX 4090, checkout the comments for details. If you have a 6GB or 8GB VRAM GPU, post your numbers. more benchmarks and configs coming soon

Alok

259,993 görüntüleme • 29 gün önce

Qwen3.5-35B with only 3B active parameters This MoE model runs FASTER than most 7B dense models. Tested on 3 generations of NVIDIA: - 5090: 137 tok/s - 4090: 112 tok/s - 3090: 78 tok/s The surprise? The 4090 <> 5090 gap is only 22%. With a 3B active MoE, even old GPUs fly.

Qwen3.5-35B with only 3B active parameters This MoE model runs FASTER than most 7B dense models. Tested on 3 generations of NVIDIA: - 5090: 137 tok/s - 4090: 112 tok/s - 3090: 78 tok/s The surprise? The 4090 <> 5090 gap is only 22%. With a 3B active MoE, even old GPUs fly.

stevibe

69,501 görüntüleme • 3 ay önce

nemotron 3 omni q8 on dgx spark 128gb vram cranking via hermes agent at 56 tok/s. first night of real local agentic on this box and local hits harder than i thought it would. q8 (near lossless quant, perplexity loss <1% vs fp16) running 256k context on 33 gb of unified memory, 90+ gb still free. multimodal omni weights included. hermes agent driving from telegram, talking to it from bed. speed: 56 tok/s generation, 1,300 tok/s prefill. for context, qwen 3.6 27b at q4 (heavy quant) on 3090 = 40 tok/s. nemotron at higher precision quant on spark beats qwen at lower precision quant on 3090. moe 3.5b active params architecture earns its keep. what i tested tonight: agentic tool calling works clean. ask it to check disks, it autonomously runs df -h through hermes agent. ask it to set up telegram gateway, it invokes the hermes-agent skill, walks through the prompts, completes the flow. overthinks a bit before tool calls (reasoning model trait) but lands the right move every time. researches api docs, internalizes, tests, documents. completes tasks. current models on dgx spark: 9 gguf files, 305 gb total, mix of qwen 3.6 27b dense (5 quants), nemotron omni (4 quants), deepseek v4-flash 158b q4 (the 112gb flagship test). more data coming this week as i benchmark each.

nemotron 3 omni q8 on dgx spark 128gb vram cranking via hermes agent at 56 tok/s. first night of real local agentic on this box and local hits harder than i thought it would. q8 (near lossless quant, perplexity loss <1% vs fp16) running 256k context on 33 gb of unified memory, 90+ gb still free. multimodal omni weights included. hermes agent driving from telegram, talking to it from bed. speed: 56 tok/s generation, 1,300 tok/s prefill. for context, qwen 3.6 27b at q4 (heavy quant) on 3090 = 40 tok/s. nemotron at higher precision quant on spark beats qwen at lower precision quant on 3090. moe 3.5b active params architecture earns its keep. what i tested tonight: agentic tool calling works clean. ask it to check disks, it autonomously runs df -h through hermes agent. ask it to set up telegram gateway, it invokes the hermes-agent skill, walks through the prompts, completes the flow. overthinks a bit before tool calls (reasoning model trait) but lands the right move every time. researches api docs, internalizes, tests, documents. completes tasks. current models on dgx spark: 9 gguf files, 305 gb total, mix of qwen 3.6 27b dense (5 quants), nemotron omni (4 quants), deepseek v4-flash 158b q4 (the 112gb flagship test). more data coming this week as i benchmark each.

Sudo su

30,228 görüntüleme • 2 ay önce

I ran a 35-billion parameter AI agent on a $600 Mac mini. Specs: M4 Mac-Mini 16GB RAM The model doesn't fit in RAM. It pages from the SSD at 30 tokens/second. On NVIDIA, the same paging gives you 1.6 tok/s. Apple Silicon gives you 30. That's 18.6x faster. No cloud. No API keys. $0/month. Here's what it can do 🧵

I ran a 35-billion parameter AI agent on a $600 Mac mini. Specs: M4 Mac-Mini 16GB RAM The model doesn't fit in RAM. It pages from the SSD at 30 tokens/second. On NVIDIA, the same paging gives you 1.6 tok/s. Apple Silicon gives you 30. That's 18.6x faster. No cloud. No API keys. $0/month. Here's what it can do 🧵

thestreamingdev()

727,150 görüntüleme • 3 ay önce

Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp Here are my results for Qwen3.6 35B A3B, with 64k q8_0 context on a 8GB RTX 3070Ti: ⚪️ no flag → 8.7 tok/s RAM: 13.6GB & VRAM: 7.8GB 🔴 -ncmoe 35 → 27.5 tok/s RAM: 12.1GB & VRAM: 4.3GB 🟢 -ncmoe 30 → 32.5 tok/s RAM: 12GB & VRAM: 5.6GB 🔵 -ncmoe 25 → 40.9 tok/s RAM: 12GB & VRAM: 6.9GB Please note the ram and vram usage you see are total usage of a windows pc, with the model running. My friend's setup: 8GB VRAM and 16GB RAM. You can boost performance by switching to Linux, just something to keep in mind. Basically, this flag keeps the MoE experts in the first X layers on your CPU + RAM, instead of eating all your VRAM straight away. This is a smart hybrid offload way that lets you run bigger models without OOM while keeping the rest on your GPU for speed. As we can see on the data, there's a sweet spot. When we lower it from 35 to 25, speed bumps +50% because there are more layers on your GPU (look at the VRAM usage). The key here is to play around with the number and fit as much as possible on your VRAM, goal is to have 1GB/800MB headroom to avoid stress. ↓ server flags below

Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp Here are my results for Qwen3.6 35B A3B, with 64k q8_0 context on a 8GB RTX 3070Ti: ⚪️ no flag → 8.7 tok/s RAM: 13.6GB & VRAM: 7.8GB 🔴 -ncmoe 35 → 27.5 tok/s RAM: 12.1GB & VRAM: 4.3GB 🟢 -ncmoe 30 → 32.5 tok/s RAM: 12GB & VRAM: 5.6GB 🔵 -ncmoe 25 → 40.9 tok/s RAM: 12GB & VRAM: 6.9GB Please note the ram and vram usage you see are total usage of a windows pc, with the model running. My friend's setup: 8GB VRAM and 16GB RAM. You can boost performance by switching to Linux, just something to keep in mind. Basically, this flag keeps the MoE experts in the first X layers on your CPU + RAM, instead of eating all your VRAM straight away. This is a smart hybrid offload way that lets you run bigger models without OOM while keeping the rest on your GPU for speed. As we can see on the data, there's a sweet spot. When we lower it from 35 to 25, speed bumps +50% because there are more layers on your GPU (look at the VRAM usage). The key here is to play around with the number and fit as much as possible on your VRAM, goal is to have 1GB/800MB headroom to avoid stress. ↓ server flags below

left curve dev

165,327 görüntüleme • 1 ay önce

no prompt engineering, no agentic harness, no tool calls. just me being lazy in llama.cpp's web ui and gemma 4 31b dense taking the task seriously. i typed "create gpu marketplace cards with hardware specs and prices per hour" and the model went and coded this ui, one shot, navy bg, glassmorphism cards, neon accent buttons, realistic pricing tiers per architecture. it even wrote a "why this looks premium" explanation under the code. for context this is a q4 quant of google's 31b dense thinking model, running on a rtx 5090 mobile 24gb in the rog scar 18 at around 15 tok/s sustained, same vram tier as a 3090 or 4090 desktop, so whatever you see here translates directly to your card at home. the whole interaction was me not trying and the model reasoning harder than the prompt deserved. that tells me more about where local ai is at in april 2026 than any leaderboard score. next test drops gemma 4 into hermes agent, autonomous tool calling, multi step reasoning, real agentic loop instead of a chat window. let's see what the same model does when it gets the right environment. more experiments coming anon. octopus invaders queued. same hardware, different tasks, all published here on x and all translatable to your 24gb card. for now the video below shows it coding live, gpu going brrr.

no prompt engineering, no agentic harness, no tool calls. just me being lazy in llama.cpp's web ui and gemma 4 31b dense taking the task seriously. i typed "create gpu marketplace cards with hardware specs and prices per hour" and the model went and coded this ui, one shot, navy bg, glassmorphism cards, neon accent buttons, realistic pricing tiers per architecture. it even wrote a "why this looks premium" explanation under the code. for context this is a q4 quant of google's 31b dense thinking model, running on a rtx 5090 mobile 24gb in the rog scar 18 at around 15 tok/s sustained, same vram tier as a 3090 or 4090 desktop, so whatever you see here translates directly to your card at home. the whole interaction was me not trying and the model reasoning harder than the prompt deserved. that tells me more about where local ai is at in april 2026 than any leaderboard score. next test drops gemma 4 into hermes agent, autonomous tool calling, multi step reasoning, real agentic loop instead of a chat window. let's see what the same model does when it gets the right environment. more experiments coming anon. octopus invaders queued. same hardware, different tasks, all published here on x and all translatable to your 24gb card. for now the video below shows it coding live, gpu going brrr.

Sudo su

28,939 görüntüleme • 2 ay önce

The RTX 3090 is a 5-year-old GPU and it still runs a 27B model at 20 tok/s I tested Qwen3.5:27b across 3 generations of NVIDIA: 5090 → ~60 tok/s 4090 → ~40 tok/s 3090 → ~20 tok/s Perfectly linear scaling. Double the generation, double the speed.

The RTX 3090 is a 5-year-old GPU and it still runs a 27B model at 20 tok/s I tested Qwen3.5:27b across 3 generations of NVIDIA: 5090 → ~60 tok/s 4090 → ~40 tok/s 3090 → ~20 tok/s Perfectly linear scaling. Double the generation, double the speed.

stevibe

142,379 görüntüleme • 4 ay önce

the tiebreaker is done. qwen 3.5 27B dense. single RTX 3090. one prompt. zero steering. zero human edits. 1,827 lines across 10 files. 13 minutes. full thinking mode. runs on first load. hermes 4.3 got the same prompt with 2x 3090s and 5x the context it needed. wrote 1,249 lines, left empty files, needed 3 interventions, game was broken on load. same architecture class. same quant. hermes got double the hardware. completely different result. dense wasn't the problem. hermes was. but here's what got me. this model thinks at 27 tok/s. every single token carries 27 billion parameters of reasoning. MoE hit 112 tok/s but only 3B active per token. the dense model is slower and it doesn't matter. watch 13 minutes of autonomous coding on a consumer GPU with zero intervention and tell me speed is what matters. a year ago this wasn't possible. now it runs on hardware you can buy used for $900. no API. no subscription. no cloud. just a 3090 doing what data centers did 18 months ago. full unedited session in the video. every token, every file, every thinking chain. 16 minutes. hit play.

the tiebreaker is done. qwen 3.5 27B dense. single RTX 3090. one prompt. zero steering. zero human edits. 1,827 lines across 10 files. 13 minutes. full thinking mode. runs on first load. hermes 4.3 got the same prompt with 2x 3090s and 5x the context it needed. wrote 1,249 lines, left empty files, needed 3 interventions, game was broken on load. same architecture class. same quant. hermes got double the hardware. completely different result. dense wasn't the problem. hermes was. but here's what got me. this model thinks at 27 tok/s. every single token carries 27 billion parameters of reasoning. MoE hit 112 tok/s but only 3B active per token. the dense model is slower and it doesn't matter. watch 13 minutes of autonomous coding on a consumer GPU with zero intervention and tell me speed is what matters. a year ago this wasn't possible. now it runs on hardware you can buy used for $900. no API. no subscription. no cloud. just a 3090 doing what data centers did 18 months ago. full unedited session in the video. every token, every file, every thinking chain. 16 minutes. hit play.

Sudo su

91,135 görüntüleme • 4 ay önce

i'm running a 397 billion parameter model on a amd ai max box that sits on my desk and pulls less power than a gaming laptop. the model is Nex-N2-Pro, 397B-A17B, the open weight release people are putting next to gpt-5.5 on coding. i have it quantized to IQ1_M, 1.75 bits per weight, 90gb of weights loaded into the 128gb of unified memory on amd's strix halo igpu. watch the gpu in this recording. it spikes, it sustains, it does not fall over. that is the part the spec sheets never show you, not just that a 400b model loads, but that an integrated graphics chip holds the load and generates token after token, stable, no crash, no thermal cliff. and it is not a slideshow. roughly 18 tokens a second, faster than you can read. a frontier scale model producing usable output, fully local. no datacenter, no rented h100s, no api key, no permission. three years ago a model this size meant a server room and a budget to match. tonight it is a quiet box on my desk. this is the accessible tier almost nobody benchmarks honestly, and it is further along than the timeline thinks. the full breakdown is coming, rocm vs vulkan on this chip, and this little amd box head to head against the nvidia equivalent. stay tuned.

i'm running a 397 billion parameter model on a amd ai max box that sits on my desk and pulls less power than a gaming laptop. the model is Nex-N2-Pro, 397B-A17B, the open weight release people are putting next to gpt-5.5 on coding. i have it quantized to IQ1_M, 1.75 bits per weight, 90gb of weights loaded into the 128gb of unified memory on amd's strix halo igpu. watch the gpu in this recording. it spikes, it sustains, it does not fall over. that is the part the spec sheets never show you, not just that a 400b model loads, but that an integrated graphics chip holds the load and generates token after token, stable, no crash, no thermal cliff. and it is not a slideshow. roughly 18 tokens a second, faster than you can read. a frontier scale model producing usable output, fully local. no datacenter, no rented h100s, no api key, no permission. three years ago a model this size meant a server room and a budget to match. tonight it is a quiet box on my desk. this is the accessible tier almost nobody benchmarks honestly, and it is further along than the timeline thinks. the full breakdown is coming, rocm vs vulkan on this chip, and this little amd box head to head against the nvidia equivalent. stay tuned.

Sudo su

31,780 görüntüleme • 19 gün önce

small local model that falls apart in bloated agents like openclaw just runs like a wild horse in hermes agent. and that's not even my line, someone else called it that, i've just been quietly pointing people at this harness for months because it held up on everything i threw at it, 3b models all the way to one trillion params. watch this happen on my own machine. i pointed hermes agent at a local http endpoint, gemma 4 12b on my 3090 llama.cpp server, and it auto-detected the model and started working immediately. no config wrestling, no broken tool calls, no babysitting the output format, i typed in a url and it just went. the whole clip is exactly that, start to finish, no errors, no retries, butter smooth. and the tool calling, the one thing that quietly breaks most local setups, works here like it's nothing. it's not the model that's flaky, it's the harness around it. hermes agent is the first agent i've run that actually gets that right. one url, one local model on one card, and it runs like a wild horse.

small local model that falls apart in bloated agents like openclaw just runs like a wild horse in hermes agent. and that's not even my line, someone else called it that, i've just been quietly pointing people at this harness for months because it held up on everything i threw at it, 3b models all the way to one trillion params. watch this happen on my own machine. i pointed hermes agent at a local http endpoint, gemma 4 12b on my 3090 llama.cpp server, and it auto-detected the model and started working immediately. no config wrestling, no broken tool calls, no babysitting the output format, i typed in a url and it just went. the whole clip is exactly that, start to finish, no errors, no retries, butter smooth. and the tool calling, the one thing that quietly breaks most local setups, works here like it's nothing. it's not the model that's flaky, it's the harness around it. hermes agent is the first agent i've run that actually gets that right. one url, one local model on one card, and it runs like a wild horse.

Sudo su

27,339 görüntüleme • 29 gün önce

New Google Gemma 4 12B claims near-26B performance - we tested both! We ran both models locally on one RTX 4090 and gave each the same task: write a self-contained HTML5 canvas animation with real physics in one file without libraries. Three scenes - a Galton board, two blocks colliding off a wall, and a chaotic triple pendulum Outputs: Gemma 4 26B-A4B: 15 GB VRAM usage, 6.9k tokens, 138 tok/s Gemma 4 12B: 9 GB VRAM usage, 8.9k tokens, 80 tok/s Same Gemma 4 family, but the 26B-A4B won every scene and ran ~1.7x faster - on just 4B active params. The 12B stayed very close though, on almost half the VRAM - which makes it the ideal model for a 16 GB laptop

New Google Gemma 4 12B claims near-26B performance - we tested both! We ran both models locally on one RTX 4090 and gave each the same task: write a self-contained HTML5 canvas animation with real physics in one file without libraries. Three scenes - a Galton board, two blocks colliding off a wall, and a chaotic triple pendulum Outputs: Gemma 4 26B-A4B: 15 GB VRAM usage, 6.9k tokens, 138 tok/s Gemma 4 12B: 9 GB VRAM usage, 8.9k tokens, 80 tok/s Same Gemma 4 family, but the 26B-A4B won every scene and ran ~1.7x faster - on just 4B active params. The 12B stayed very close though, on almost half the VRAM - which makes it the ideal model for a 16 GB laptop

atomic.chat

151,491 görüntüleme • 1 ay önce

this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 of 24gb vram gone, 94 watts at 50c. no api keys. no rate limits. no "your prompts are being used for training". no monthly subscription. no anthropic telling me what i can and cant ask. no openai logging my work. no outages when aws goes down. just google deepmind's open weights, open source llama.cpp, nous research's hermes agent, a rog scar 18 on my desk, and 95 watts of sustained compute while it builds stuff on its own. the laptop is roaring. results incoming.

this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 of 24gb vram gone, 94 watts at 50c. no api keys. no rate limits. no "your prompts are being used for training". no monthly subscription. no anthropic telling me what i can and cant ask. no openai logging my work. no outages when aws goes down. just google deepmind's open weights, open source llama.cpp, nous research's hermes agent, a rog scar 18 on my desk, and 95 watts of sustained compute while it builds stuff on its own. the laptop is roaring. results incoming.

Sudo su

65,567 görüntüleme • 2 ay önce

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

Alok

59,908 görüntüleme • 13 gün önce

i built a full game on a single GPU with a 3B model and this is the worst local AI will ever be. this was supposed to be a benchmark test. load the model, measure tokens per second, write it up, move on. instead i spent 20 minutes playing Octopus Invaders because the game is genuinely fun and i couldn't stop. a model with 3B active parameters built this from a single prompt. it debugged its own collision system when bullets were phasing through enemies. read the error, found the fix, kept building. this is not a frontier API. this is a quantized open source model running on hardware you can buy used for $800-$1200. no cloud. no subscription. no API costs. just a mass produced consumer GPU doing things that would have been absurd 12 months ago. and here's the part that should keep you up at night: every month the models get smaller and smarter. the quants get tighter. the context windows get longer. the tooling gets cleaner. what 3B active parameters does today on 24gb, a 1B model will do on 8gb within a year. you are looking at the floor. not the ceiling.

i built a full game on a single GPU with a 3B model and this is the worst local AI will ever be. this was supposed to be a benchmark test. load the model, measure tokens per second, write it up, move on. instead i spent 20 minutes playing Octopus Invaders because the game is genuinely fun and i couldn't stop. a model with 3B active parameters built this from a single prompt. it debugged its own collision system when bullets were phasing through enemies. read the error, found the fix, kept building. this is not a frontier API. this is a quantized open source model running on hardware you can buy used for $800-$1200. no cloud. no subscription. no API costs. just a mass produced consumer GPU doing things that would have been absurd 12 months ago. and here's the part that should keep you up at night: every month the models get smaller and smarter. the quants get tighter. the context windows get longer. the tooling gets cleaner. what 3B active parameters does today on 24gb, a 1B model will do on 8gb within a year. you are looking at the floor. not the ceiling.

Sudo su

36,251 görüntüleme • 4 ay önce

got gemma 4 31B with MTP running on my DGX Spark. Hermes Agent did most of the legwork. baseline vs MTP on GB10: • c=1: 3.65 → 6.37 tok/s (1.74x) • c=4: 14.34 → 23.59 tok/s (1.65x) • c=8: 14.37 → 24.18 tok/s (1.68x) google says "up to 2x" — we're not quite there but it's real, not vapor. stack: DGX Spark / GB10 + gemma-4-31b-it + gemma-4-31b-it-assistant (MTP drafter) + vLLM built from PR 41745 MTP is basically a lightweight draft model that predicts multiple tokens while the big model verifies them all at once. smaller model does the busywork, bigger model just says yes/no. simple idea, weird to implement. next: tune the draft block size and see if we can push past 2x. also want to try it with Hermes Agent feeding prompts end to end. p.s: this was all done from telegram. Google DeepMind NVIDIA AI Developer

got gemma 4 31B with MTP running on my DGX Spark. Hermes Agent did most of the legwork. baseline vs MTP on GB10: • c=1: 3.65 → 6.37 tok/s (1.74x) • c=4: 14.34 → 23.59 tok/s (1.65x) • c=8: 14.37 → 24.18 tok/s (1.68x) google says "up to 2x" — we're not quite there but it's real, not vapor. stack: DGX Spark / GB10 + gemma-4-31b-it + gemma-4-31b-it-assistant (MTP drafter) + vLLM built from PR 41745 MTP is basically a lightweight draft model that predicts multiple tokens while the big model verifies them all at once. smaller model does the busywork, bigger model just says yes/no. simple idea, weird to implement. next: tune the draft block size and see if we can push past 2x. also want to try it with Hermes Agent feeding prompts end to end. p.s: this was all done from telegram. Google DeepMind NVIDIA AI Developer

Joey

22,855 görüntüleme • 2 ay önce

I'm running Llama 4 Maverick at 620 t/s! I'm living in the future! Honestly, a large language model running this fast is something straight out of a sci-fi movie. Speeds like this will enable a whole new world of applications that aren't possible today. For reference, GPT-4o, which is probably the most popular OpenAI model, runs between 60 and 110 t/s. The secret here: I'm not running AI at Meta's Llama 4 Maverick on a GPU. I'm using the SambaNova Cloud (my sponsor) and their custom SN40L chips. They are optimized from the ground up for running AI workflows. Right now, SambaNova Cloud runs DeepSeek, Qwen, Whisper, and the entire family of Llama models on these chips. You can check the speed of each of these models using SambaNova Cloud's Playground (see the attached video). It's completely free, and that's how I'm measuring their speeds. For example, I also tried DeepSeek R1 (the latest version from May) and, oh boy! DeepSeek R1 is a huge 671B parameter model. It's probably the best open reasoning model in the world, and it runs at 140 tokens per second! !!! Inference time on an SN40L is night and day from what you'll get from a GPU. Here is why this is big: If you are running an agentic workflow that uses multiple models simultaneously on a GPU, it will need to swap models in and out of memory (because not every model fits). A single SNL40 chip can simultaneously hold over 100 models (trillions of parameters) in memory. If you are using open models, try the SambaCloud API to see what lightning speed looks like. Here is how: 1. Create a free account at: 2. Check the QuickStart guide: If you try the playground, check the speed you're getting with Llama 4 and DeepSeek, and post the results below. I've seen much higher numbers than I posted here, so I'm curious to see whether geography affects the speed.

I'm running Llama 4 Maverick at 620 t/s! I'm living in the future! Honestly, a large language model running this fast is something straight out of a sci-fi movie. Speeds like this will enable a whole new world of applications that aren't possible today. For reference, GPT-4o, which is probably the most popular OpenAI model, runs between 60 and 110 t/s. The secret here: I'm not running AI at Meta's Llama 4 Maverick on a GPU. I'm using the SambaNova Cloud (my sponsor) and their custom SN40L chips. They are optimized from the ground up for running AI workflows. Right now, SambaNova Cloud runs DeepSeek, Qwen, Whisper, and the entire family of Llama models on these chips. You can check the speed of each of these models using SambaNova Cloud's Playground (see the attached video). It's completely free, and that's how I'm measuring their speeds. For example, I also tried DeepSeek R1 (the latest version from May) and, oh boy! DeepSeek R1 is a huge 671B parameter model. It's probably the best open reasoning model in the world, and it runs at 140 tokens per second! !!! Inference time on an SN40L is night and day from what you'll get from a GPU. Here is why this is big: If you are running an agentic workflow that uses multiple models simultaneously on a GPU, it will need to swap models in and out of memory (because not every model fits). A single SNL40 chip can simultaneously hold over 100 models (trillions of parameters) in memory. If you are using open models, try the SambaCloud API to see what lightning speed looks like. Here is how: 1. Create a free account at: 2. Check the QuickStart guide: If you try the playground, check the speed you're getting with Llama 4 and DeepSeek, and post the results below. I've seen much higher numbers than I posted here, so I'm curious to see whether geography affects the speed.

Santiago

34,148 görüntüleme • 1 yıl önce

THE MS-A2 IS THE MINISFORUM BOX HERMES AGENT WAS NEVER MEANT TO RUN ON, AND THAT'S BY DESIGN The MS-S1 MAX earns its spot running Hermes Agent because of one thing. 128GB of unified memory, up to 96GB of it handed straight to the GPU. That's the only reason a 120B model fits and runs locally for $0 a month The MS-A2 solves a different problem. Ryzen 9 9955HX, 16 cores, 32 threads, up to 96GB of regular DDR5-5600, no unified pool Three M.2 PCIe 4.0 slots, one U.2, two 22110. Dual 10Gbps SFP+ LAN plus 2.5G. WiFi 6E. Bluetooth 5.3. A slide-out motherboard for fast upgrades. A real PCIe x16 slot that actually takes a low-profile GPU That last part is where the two machines split for good. The MS-S1 MAX's PCIe slot won't take a GPU at all, every bit of GPU power has to come from the unified chip itself. The MS-A2 trades that unified memory trick for raw expandability instead One box runs a local AI agent. The other runs a home lab that needs storage, networking, and room to grow. Minisforum built both on purpose, not as the same product wearing two names

NO1ennn

25,119 görüntüleme • 15 gün önce

Everyone's comparing the DGX Spark to a 5090 and calling it slow. I think that's the wrong comparison. I ran Qwen3.6 35B-A3B FP8 with the full 262K context window enabled (~96GB RAM) — something gaming GPUs can't really do. Results: 🟢No context: 51.3 tok/s, TTFT 110ms 🟣200K prefill: 34.6 tok/s, TTFT 85s (~2,341 tok/s prefill) Prefill is way faster than a Mac. And 35 tok/s deep into 200K context, on a model this strong, is genuinely usable. The Spark plays a different game.

Everyone's comparing the DGX Spark to a 5090 and calling it slow. I think that's the wrong comparison. I ran Qwen3.6 35B-A3B FP8 with the full 262K context window enabled (~96GB RAM) — something gaming GPUs can't really do. Results: 🟢No context: 51.3 tok/s, TTFT 110ms 🟣200K prefill: 34.6 tok/s, TTFT 85s (~2,341 tok/s prefill) Prefill is way faster than a Mac. And 35 tok/s deep into 200K context, on a model this strong, is genuinely usable. The Spark plays a different game.

stevibe

33,301 görüntüleme • 2 ay önce