stevibe's banner
stevibe's profile picture

stevibe

@stevibe22,337 subscribers

LLM. Local AI addict. Building @BenchLocalAI Builds things nobody asked for. Benchmarks things for fun.

Shorts

Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together. (The test: place each element at the right pixel position on a blank form image, not type into a field.) Setup: > Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool). > I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height. > The blue boxes on the screen are its detections. Look how tight they are — it nails every field. Result: > Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas. > Character-box alignment still a touch loose, but every value is where it belongs. > 9m10s, 224.5k input, 24.3k output, 21 turns. Why it matters: > Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can. > A combination of small models can do the work of a single large one.

Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together. (The test: place each element at the right pixel position on a blank form image, not type into a field.) Setup: > Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool). > I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height. > The blue boxes on the screen are its detections. Look how tight they are — it nails every field. Result: > Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas. > Character-box alignment still a touch loose, but every value is where it belongs. > 9m10s, 224.5k input, 24.3k output, 21 turns. Why it matters: > Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can. > A combination of small models can do the work of a single large one.

141,288 просмотров

I explored a further possibility with local models: Qwen3.6 35B A3B + NVIDIA LocateAnything-3B as a local Computer Use agent (proof of concept). In the demo, I asked it to switch my Mac to light mode. It did. Then back to dark. Did that too — finding the right toggle in System Settings, clicking it, and verifying the change itself. It's fully screenshot-based, so no Accessibility API needed. If it's on screen, the agent can see it and act on it. This runs entirely on your own hardware — private, local, built from two small open models.

I explored a further possibility with local models: Qwen3.6 35B A3B + NVIDIA LocateAnything-3B as a local Computer Use agent (proof of concept). In the demo, I asked it to switch my Mac to light mode. It did. Then back to dark. Did that too — finding the right toggle in System Settings, clicking it, and verifying the change itself. It's fully screenshot-based, so no Accessibility API needed. If it's on screen, the agent can see it and act on it. This runs entirely on your own hardware — private, local, built from two small open models.

38,196 просмотров

Claude Sonnet 4.6, when asked in Chinese: “你是什么模型?” (What model are you?) Confidently replies: “我是 DeepSeek。” (I am DeepSeek) This is the same model whose company just accused DeepSeek of “industrial-scale distillation attacks”

Claude Sonnet 4.6, when asked in Chinese: “你是什么模型?” (What model are you?) Confidently replies: “我是 DeepSeek。” (I am DeepSeek) This is the same model whose company just accused DeepSeek of “industrial-scale distillation attacks”

1,913,424 просмотров

MiniMax M3 just dropped — their first natively multimodal model. So I ran it through my form-filling test. (The model has to place each element at the right pixel position on a blank form image, not type into a field.) Verdict: it got everything on the paper. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code, all there. > Best character spacing I've seen yet: it actually calculates the gap between each character, clean across the DOB and number boxes > A few fields slightly misaligned, but every piece of data made it onto the form The reasoning chain is the interesting part: it does the easy fields first, then works into the tight one-char-per-box fields, reasoning through y-coordinates, baselines, and label clearance in obsessive detail. The cost: 40:33 and 126.7k output tokens. That's a long think — but it's MiniMax's first multimodal model, and it nailed the content.

MiniMax M3 just dropped — their first natively multimodal model. So I ran it through my form-filling test. (The model has to place each element at the right pixel position on a blank form image, not type into a field.) Verdict: it got everything on the paper. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code, all there. > Best character spacing I've seen yet: it actually calculates the gap between each character, clean across the DOB and number boxes > A few fields slightly misaligned, but every piece of data made it onto the form The reasoning chain is the interesting part: it does the easy fields first, then works into the tight one-char-per-box fields, reasoning through y-coordinates, baselines, and label clearance in obsessive detail. The cost: 40:33 and 126.7k output tokens. That's a long think — but it's MiniMax's first multimodal model, and it nailed the content.

27,143 просмотров

MiniMax M3 might be the most underrated coding model right now. I gave it nothing but a screenshot of a chaotic 90s GeoCities-style fan page, no HTML source, just the image + the asset files, and told it to rebuild the whole thing as a sleek Apple-style 2026 site. One shot. Through OpenCode. The result is genuinely stunning. It kept the soul (the "stevibe's HyperHome" identity, the visitor counter, the guestbook, the webmaster portrait) and translated every section into clean modern design, gradient hero, proper typography, dark theme, the works.

MiniMax M3 might be the most underrated coding model right now. I gave it nothing but a screenshot of a chaotic 90s GeoCities-style fan page, no HTML source, just the image + the asset files, and told it to rebuild the whole thing as a sleek Apple-style 2026 site. One shot. Through OpenCode. The result is genuinely stunning. It kept the soul (the "stevibe's HyperHome" identity, the visitor counter, the guestbook, the webmaster portrait) and translated every section into clean modern design, gradient hero, proper typography, dark theme, the works.

20,700 просмотров

Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too. Only two models went all green: the 27B dense and the distilled 27B. The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two. The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit. The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output. Small models hallucinate data. Big models ignore data. The 27B just threaded it through.

Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too. Only two models went all green: the 27B dense and the distilled 27B. The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two. The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit. The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output. Small models hallucinate data. Big models ignore data. The 27B just threaded it through.

420,500 просмотров

Qwen3.5 27B vs Gemma4 31B | Canvas Creativity Test Why HTML Canvas? Two reasons: 1. It's unforgiving, one small mistake and the whole thing breaks 2. We kept prompts short to test real creativity, not instruction following 4 rounds: - Analog Clock - Hyperspace Tunnel - Growing Tree - Black Hole Both nailed the clock, but the other three is where it gets interesting. Looking forward to Qwen3.6 open-weight release!

Qwen3.5 27B vs Gemma4 31B | Canvas Creativity Test Why HTML Canvas? Two reasons: 1. It's unforgiving, one small mistake and the whole thing breaks 2. We kept prompts short to test real creativity, not instruction following 4 rounds: - Analog Clock - Hyperspace Tunnel - Growing Tree - Black Hole Both nailed the clock, but the other three is where it gets interesting. Looking forward to Qwen3.6 open-weight release!

170,704 просмотров

NVIDIA just dropped Nemotron-3-Nano:4b — a tiny 2.8GB model. Guess whose hardware runs it the fastest? - RTX 4090: 226 tok/s - RTX 3090: 187 tok/s - Mac Studio M2 Ultra: 86 tok/s - Mac Mini M4: 25 tok/s Home court advantage is real. Also trying a new layout with live performance charts. Lmk what you think!

NVIDIA just dropped Nemotron-3-Nano:4b — a tiny 2.8GB model. Guess whose hardware runs it the fastest? - RTX 4090: 226 tok/s - RTX 3090: 187 tok/s - Mac Studio M2 Ultra: 86 tok/s - Mac Mini M4: 25 tok/s Home court advantage is real. Also trying a new layout with live performance charts. Lmk what you think!

127,233 просмотров

Some people doubted the previous test because it was routed through OpenRouter. So I ran the test again directly through Anthropic’s official API endpoint. Here’s what happened:

Some people doubted the previous test because it was routed through OpenRouter. So I ran the test again directly through Anthropic’s official API endpoint. Here’s what happened:

151,312 просмотров

Got a 16GB GPU? You can run all of these right now. Tested 4 Qwen3.5-based models on ToolCall-15 & BugFind-15: Models: - Qwen3.5:9b Q8 (Official) - Qwopus v3 Q8 by Jackrong - OmniCoder-9B by Tesslate - Qwen3.5-9b-Sushi-Coder by bigatuna Summary: - ToolCall-15: Qwopus v3 went perfect 30/30, Sushicoder beat base Qwen3.5 - BugFind-15: Omnicoder flipped the script and took #1 at 83% No single model won both, that's the fun part. Open source community is cooking.

Got a 16GB GPU? You can run all of these right now. Tested 4 Qwen3.5-based models on ToolCall-15 & BugFind-15: Models: - Qwen3.5:9b Q8 (Official) - Qwopus v3 Q8 by Jackrong - OmniCoder-9B by Tesslate - Qwen3.5-9b-Sushi-Coder by bigatuna Summary: - ToolCall-15: Qwopus v3 went perfect 30/30, Sushicoder beat base Qwen3.5 - BugFind-15: Omnicoder flipped the script and took #1 at 83% No single model won both, that's the fun part. Open source community is cooking.

74,968 просмотров

Qwen3.5-27B went 15/15 on our tool-calling benchmark. But which quant should you actually run? Tested Unsloth's Q2_K_XL all the way to Q8_K_XL TL;DR: Q8 — 15/15 ✅ Q6 — 15/15 ✅ Q5 — 14/15 Q4 — 14/15 Q3 — 14/15 Q2 — 13/15 Q6 is the sweet spot. Same perfect score as Q8, smaller footprint. Also, the results scale almost linearly, seems like ToolCall-15 is actually measuring something real.

Qwen3.5-27B went 15/15 on our tool-calling benchmark. But which quant should you actually run? Tested Unsloth's Q2_K_XL all the way to Q8_K_XL TL;DR: Q8 — 15/15 ✅ Q6 — 15/15 ✅ Q5 — 14/15 Q4 — 14/15 Q3 — 14/15 Q2 — 13/15 Q6 is the sweet spot. Same perfect score as Q8, smaller footprint. Also, the results scale almost linearly, seems like ToolCall-15 is actually measuring something real.

61,142 просмотров

Wait, what? Asked in French: “nom du modèle” (name of the model?) It replies: “ChatGPT” Credits to Mike Mickelson for the idea

Wait, what? Asked in French: “nom du modèle” (name of the model?) It replies: “ChatGPT” Credits to Mike Mickelson for the idea

82,758 просмотров

GLM 5.1 just went open-weight on Hugging Face, but how does it compare to GLM 5? I have tested both with the canvas tree challenge. 5.1 thinks longer, but delivers wind animation, sun, clouds, and way more detail. Prompt attached: Write a single HTML file with a full-page canvas, no libraries. Animate a tree that grows from the bottom center of the screen in real time. The trunk grows upward first, then branches split off recursively with slight randomness in angle and length. Each generation of branches should be thinner and slightly lighter in color. When branches reach their final size, add small leaves as soft green circles at the tips. The tree should take about 15 seconds to fully grow. Use warm brown for wood and varied greens for leaves against a soft sky-blue gradient background.

GLM 5.1 just went open-weight on Hugging Face, but how does it compare to GLM 5? I have tested both with the canvas tree challenge. 5.1 thinks longer, but delivers wind animation, sun, clouds, and way more detail. Prompt attached: Write a single HTML file with a full-page canvas, no libraries. Animate a tree that grows from the bottom center of the screen in real time. The trunk grows upward first, then branches split off recursively with slight randomness in angle and length. Each generation of branches should be thinner and slightly lighter in color. When branches reach their final size, add small leaves as soft green circles at the tips. The tree should take about 15 seconds to fully grow. Use warm brown for wood and varied greens for leaves against a soft sky-blue gradient background.

46,657 просмотров

"I'm not a human." Fed it to Qwen 3.5 0.8B running locally on my Mac Studio M2 Ultra. It solved it. The CAPTCHA is fake. But sending images to the local model? Very real. I'm not breaking the internet. Yet.

"I'm not a human." Fed it to Qwen 3.5 0.8B running locally on my Mac Studio M2 Ultra. It solved it. The CAPTCHA is fake. But sending images to the local model? Very real. I'm not breaking the internet. Yet.

69,433 просмотров

Introducing HermesAgent-20, a new Bench Pack for BenchLocal. 20 scenarios extracted straight from the Hermes Agent source code, run against a REAL Hermes instance. The actual workload you'd put your model through. Why I built BenchLocal in the first place: most benchmarks are too abstract. We use local LLMs for practical work, and finding the right model for YOUR task efficiently is the single most important thing, especially when you're constrained to what fits on your machine. BenchLocal is a framework: providers, models, side-by-side comparison, all in one UI. Bench Packs are the unit of testing: ToolCall-15 and BugFind-15 shipped first, and when I launched the BenchLocal 0.1.0, added StructOutput, ReasonMath, InstructFollow, DataExtract. Now, HermesAgent-20 is the newest. Bench Packs install like VS Code extensions. The SDK is open, write your own, share it, grow the ecosystem. Here's the goal: a community-built, practical evaluation layer for the local LLM space. Early numbers on HermesAgent-20: > GLM 5.1 — 85 > Gemma4 31B — 83 > Qwen3.5 27B — 79 > MiniMax M2.7 — 76 Upgrade to the latest BenchLocal to install HermesAgent-20 (SDK update required).

Introducing HermesAgent-20, a new Bench Pack for BenchLocal. 20 scenarios extracted straight from the Hermes Agent source code, run against a REAL Hermes instance. The actual workload you'd put your model through. Why I built BenchLocal in the first place: most benchmarks are too abstract. We use local LLMs for practical work, and finding the right model for YOUR task efficiently is the single most important thing, especially when you're constrained to what fits on your machine. BenchLocal is a framework: providers, models, side-by-side comparison, all in one UI. Bench Packs are the unit of testing: ToolCall-15 and BugFind-15 shipped first, and when I launched the BenchLocal 0.1.0, added StructOutput, ReasonMath, InstructFollow, DataExtract. Now, HermesAgent-20 is the newest. Bench Packs install like VS Code extensions. The SDK is open, write your own, share it, grow the ecosystem. Here's the goal: a community-built, practical evaluation layer for the local LLM space. Early numbers on HermesAgent-20: > GLM 5.1 — 85 > Gemma4 31B — 83 > Qwen3.5 27B — 79 > MiniMax M2.7 — 76 Upgrade to the latest BenchLocal to install HermesAgent-20 (SDK update required).

38,435 просмотров

How well can Qwen3.5 models debug code? I built BugFind-15 — 15 buggy snippets across Python, JS, Rust, and Go. Docker sandbox compiles and validates every fix. Two trap scenarios where the code is correct and the model must resist "fixing" it. Tested every Qwen3.5 size from 0.8B to 397B, plus Jackrong's popular distilled model (V2). The 0.8B scored 5%. The 2B scored 10%. At 4B, debugging ability jumps to 69%. The hardest scenario: BF-03, a Rust trap. The code compiles fine — format! borrows, it doesn't move. Not a single model figured this out. From 0.8B to 397B, every one of them "fixed" a bug that doesn't exist. Category C (subtle bugs — mutable defaults, integer overflow, slice aliasing) was 100% across every model 4B and above. Category D (red herring resistance) told the real story — can it resist fixing code that isn't broken? No model scored above 90%. Small models can't debug. Mid-size models fix obvious bugs but fall for traps. Large models fix the hard bugs but still invent problems that don't exist.

How well can Qwen3.5 models debug code? I built BugFind-15 — 15 buggy snippets across Python, JS, Rust, and Go. Docker sandbox compiles and validates every fix. Two trap scenarios where the code is correct and the model must resist "fixing" it. Tested every Qwen3.5 size from 0.8B to 397B, plus Jackrong's popular distilled model (V2). The 0.8B scored 5%. The 2B scored 10%. At 4B, debugging ability jumps to 69%. The hardest scenario: BF-03, a Rust trap. The code compiles fine — format! borrows, it doesn't move. Not a single model figured this out. From 0.8B to 397B, every one of them "fixed" a bug that doesn't exist. Category C (subtle bugs — mutable defaults, integer overflow, slice aliasing) was 100% across every model 4B and above. Category D (red herring resistance) told the real story — can it resist fixing code that isn't broken? No model scored above 90%. Small models can't debug. Mid-size models fix obvious bugs but fall for traps. Large models fix the hard bugs but still invent problems that don't exist.

35,006 просмотров

Been designing and experimenting with a new benchmark that stresses an underexplored angle: long tool-call chains with traps. The task: audit 36 packets, read 4 long-context ledgers, dodge retired/staging/wrong-quarter decoys, follow a strict workflow (auth → token → request → answer), submit the exact secret. Optimal: 52 calls. No call cap. I just measure how many calls each model burns to finish, and how many errors along the way. Threw 4 popular small models at it: 🥇 Qwen3.6 35B A3B (MoE) → 52 calls. Optimal. Zero errors. 🥈 Qwen3.6 27B (Dense) → 55 calls. Clean. ❌ Gemma4 31B (Dense) → 107 calls, 29 errors, looped writing auth/response.txt and re-reading auth/token.txt forever. ❌ Gemma4 26B A4B (MoE) → gave up at 13 (submitted the wrong answer). Other models I tested (GLM, DeepSeek) finish fine. So this isn't a task design issue, it's a Gemma4 issue with stateful workflows. Big models next.

Been designing and experimenting with a new benchmark that stresses an underexplored angle: long tool-call chains with traps. The task: audit 36 packets, read 4 long-context ledgers, dodge retired/staging/wrong-quarter decoys, follow a strict workflow (auth → token → request → answer), submit the exact secret. Optimal: 52 calls. No call cap. I just measure how many calls each model burns to finish, and how many errors along the way. Threw 4 popular small models at it: 🥇 Qwen3.6 35B A3B (MoE) → 52 calls. Optimal. Zero errors. 🥈 Qwen3.6 27B (Dense) → 55 calls. Clean. ❌ Gemma4 31B (Dense) → 107 calls, 29 errors, looped writing auth/response.txt and re-reading auth/token.txt forever. ❌ Gemma4 26B A4B (MoE) → gave up at 13 (submitted the wrong answer). Other models I tested (GLM, DeepSeek) finish fine. So this isn't a task design issue, it's a Gemma4 issue with stateful workflows. Big models next.

18,153 просмотров

One prompt. 6 frontier coding models. "Create a realistic fireworks show using HTML Canvas and JavaScript. No libraries." Some built a whole celebration. Others... lit a sparkler. The lineup: - GPT-5.3 Codex - Claude Opus 4.6 - Gemini 3.1 Pro - MiniMax M2.7 - GLM-5 - Kimi K2.5

One prompt. 6 frontier coding models. "Create a realistic fireworks show using HTML Canvas and JavaScript. No libraries." Some built a whole celebration. Others... lit a sparkler. The lineup: - GPT-5.3 Codex - Claude Opus 4.6 - Gemini 3.1 Pro - MiniMax M2.7 - GLM-5 - Kimi K2.5

14,152 просмотров

Videos

stevibe's profile picture

Claude Opus 4.7 vs 4.8 side-by-side canvas test

stevibe

473,704 просмотров • 6 дней назад