stevibe's banner

stevibe

@stevibe • 27,285 subscribers

LLM. Local AI addict. Building @BenchLocalAI Builds things nobody asked for. Benchmarks things for fun.

Shorts

Tested Kimi's new K3 against GPT-5.6-SOL on a tricky front-end prompt with an image reference: "Single HTML file, canvas animation: 360° rotating iPhone that disassembles into an exploded view mid-rotation, pauses 2 seconds, reassembles. Simulated 3D perspective, Apple-style aesthetic, no external libraries." Neither output is perfect, but K3's version has a cleaner exploded view. Impressive.

Tested Kimi's new K3 against GPT-5.6-SOL on a tricky front-end prompt with an image reference: "Single HTML file, canvas animation: 360° rotating iPhone that disassembles into an exploded view mid-rotation, pauses 2 seconds, reassembles. Simulated 3D perspective, Apple-style aesthetic, no external libraries." Neither output is perfect, but K3's version has a cleaner exploded view. Impressive.

53,857 次观看

Claude Sonnet 4.6, when asked in Chinese: “你是什么模型？” (What model are you?) Confidently replies: “我是 DeepSeek。” (I am DeepSeek) This is the same model whose company just accused DeepSeek of “industrial-scale distillation attacks”

Claude Sonnet 4.6, when asked in Chinese: “你是什么模型？” (What model are you?) Confidently replies: “我是 DeepSeek。” (I am DeepSeek) This is the same model whose company just accused DeepSeek of “industrial-scale distillation attacks”

1,928,269 次观看

Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too. Only two models went all green: the 27B dense and the distilled 27B. The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two. The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit. The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output. Small models hallucinate data. Big models ignore data. The 27B just threaded it through.

Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too. Only two models went all green: the 27B dense and the distilled 27B. The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two. The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit. The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output. Small models hallucinate data. Big models ignore data. The 27B just threaded it through.

428,772 次观看

Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together. (The test: place each element at the right pixel position on a blank form image, not type into a field.) Setup: > Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool). > I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height. > The blue boxes on the screen are its detections. Look how tight they are — it nails every field. Result: > Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas. > Character-box alignment still a touch loose, but every value is where it belongs. > 9m10s, 224.5k input, 24.3k output, 21 turns. Why it matters: > Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can. > A combination of small models can do the work of a single large one.

Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together. (The test: place each element at the right pixel position on a blank form image, not type into a field.) Setup: > Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool). > I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height. > The blue boxes on the screen are its detections. Look how tight they are — it nails every field. Result: > Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas. > Character-box alignment still a touch loose, but every value is where it belongs. > 9m10s, 224.5k input, 24.3k output, 21 turns. Why it matters: > Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can. > A combination of small models can do the work of a single large one.

148,631 次观看

Qwen3.5 27B vs Gemma4 31B | Canvas Creativity Test Why HTML Canvas? Two reasons: 1. It's unforgiving, one small mistake and the whole thing breaks 2. We kept prompts short to test real creativity, not instruction following 4 rounds: - Analog Clock - Hyperspace Tunnel - Growing Tree - Black Hole Both nailed the clock, but the other three is where it gets interesting. Looking forward to Qwen3.6 open-weight release!

Qwen3.5 27B vs Gemma4 31B | Canvas Creativity Test Why HTML Canvas? Two reasons: 1. It's unforgiving, one small mistake and the whole thing breaks 2. We kept prompts short to test real creativity, not instruction following 4 rounds: - Analog Clock - Hyperspace Tunnel - Growing Tree - Black Hole Both nailed the clock, but the other three is where it gets interesting. Looking forward to Qwen3.6 open-weight release!

170,768 次观看

Some people doubted the previous test because it was routed through OpenRouter. So I ran the test again directly through Anthropic’s official API endpoint. Here’s what happened:

Some people doubted the previous test because it was routed through OpenRouter. So I ran the test again directly through Anthropic’s official API endpoint. Here’s what happened:

154,891 次观看

NVIDIA just dropped Nemotron-3-Nano:4b — a tiny 2.8GB model. Guess whose hardware runs it the fastest? - RTX 4090: 226 tok/s - RTX 3090: 187 tok/s - Mac Studio M2 Ultra: 86 tok/s - Mac Mini M4: 25 tok/s Home court advantage is real. Also trying a new layout with live performance charts. Lmk what you think!

NVIDIA just dropped Nemotron-3-Nano:4b — a tiny 2.8GB model. Guess whose hardware runs it the fastest? - RTX 4090: 226 tok/s - RTX 3090: 187 tok/s - Mac Studio M2 Ultra: 86 tok/s - Mac Mini M4: 25 tok/s Home court advantage is real. Also trying a new layout with live performance charts. Lmk what you think!

127,448 次观看

I explored a further possibility with local models: Qwen3.6 35B A3B + NVIDIA LocateAnything-3B as a local Computer Use agent (proof of concept). In the demo, I asked it to switch my Mac to light mode. It did. Then back to dark. Did that too — finding the right toggle in System Settings, clicking it, and verifying the change itself. It's fully screenshot-based, so no Accessibility API needed. If it's on screen, the agent can see it and act on it. This runs entirely on your own hardware — private, local, built from two small open models.

I explored a further possibility with local models: Qwen3.6 35B A3B + NVIDIA LocateAnything-3B as a local Computer Use agent (proof of concept). In the demo, I asked it to switch my Mac to light mode. It did. Then back to dark. Did that too — finding the right toggle in System Settings, clicking it, and verifying the change itself. It's fully screenshot-based, so no Accessibility API needed. If it's on screen, the agent can see it and act on it. This runs entirely on your own hardware — private, local, built from two small open models.

43,979 次观看

Got a 16GB GPU? You can run all of these right now. Tested 4 Qwen3.5-based models on ToolCall-15 & BugFind-15: Models: - Qwen3.5:9b Q8 (Official) - Qwopus v3 Q8 by Jackrong - OmniCoder-9B by Tesslate - Qwen3.5-9b-Sushi-Coder by bigatuna Summary: - ToolCall-15: Qwopus v3 went perfect 30/30, Sushicoder beat base Qwen3.5 - BugFind-15: Omnicoder flipped the script and took #1 at 83% No single model won both, that's the fun part. Open source community is cooking.

Got a 16GB GPU? You can run all of these right now. Tested 4 Qwen3.5-based models on ToolCall-15 & BugFind-15: Models: - Qwen3.5:9b Q8 (Official) - Qwopus v3 Q8 by Jackrong - OmniCoder-9B by Tesslate - Qwen3.5-9b-Sushi-Coder by bigatuna Summary: - ToolCall-15: Qwopus v3 went perfect 30/30, Sushicoder beat base Qwen3.5 - BugFind-15: Omnicoder flipped the script and took #1 at 83% No single model won both, that's the fun part. Open source community is cooking.

75,125 次观看

Wait, what? Asked in French: “nom du modèle” (name of the model?) It replies: “ChatGPT” Credits to Mike Mickelson for the idea

Wait, what? Asked in French: “nom du modèle” (name of the model?) It replies: “ChatGPT” Credits to Mike Mickelson for the idea

82,758 次观看

Qwen3.5-27B went 15/15 on our tool-calling benchmark. But which quant should you actually run? Tested Unsloth's Q2_K_XL all the way to Q8_K_XL TL;DR: Q8 — 15/15 ✅ Q6 — 15/15 ✅ Q5 — 14/15 Q4 — 14/15 Q3 — 14/15 Q2 — 13/15 Q6 is the sweet spot. Same perfect score as Q8, smaller footprint. Also, the results scale almost linearly, seems like ToolCall-15 is actually measuring something real.

Qwen3.5-27B went 15/15 on our tool-calling benchmark. But which quant should you actually run? Tested Unsloth's Q2_K_XL all the way to Q8_K_XL TL;DR: Q8 — 15/15 ✅ Q6 — 15/15 ✅ Q5 — 14/15 Q4 — 14/15 Q3 — 14/15 Q2 — 13/15 Q6 is the sweet spot. Same perfect score as Q8, smaller footprint. Also, the results scale almost linearly, seems like ToolCall-15 is actually measuring something real.

61,266 次观看

"I'm not a human." Fed it to Qwen 3.5 0.8B running locally on my Mac Studio M2 Ultra. It solved it. The CAPTCHA is fake. But sending images to the local model? Very real. I'm not breaking the internet. Yet.

"I'm not a human." Fed it to Qwen 3.5 0.8B running locally on my Mac Studio M2 Ultra. It solved it. The CAPTCHA is fake. But sending images to the local model? Very real. I'm not breaking the internet. Yet.

69,433 次观看

MiniMax M3 just dropped — their first natively multimodal model. So I ran it through my form-filling test. (The model has to place each element at the right pixel position on a blank form image, not type into a field.) Verdict: it got everything on the paper. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code, all there. > Best character spacing I've seen yet: it actually calculates the gap between each character, clean across the DOB and number boxes > A few fields slightly misaligned, but every piece of data made it onto the form The reasoning chain is the interesting part: it does the easy fields first, then works into the tight one-char-per-box fields, reasoning through y-coordinates, baselines, and label clearance in obsessive detail. The cost: 40:33 and 126.7k output tokens. That's a long think — but it's MiniMax's first multimodal model, and it nailed the content.

MiniMax M3 just dropped — their first natively multimodal model. So I ran it through my form-filling test. (The model has to place each element at the right pixel position on a blank form image, not type into a field.) Verdict: it got everything on the paper. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code, all there. > Best character spacing I've seen yet: it actually calculates the gap between each character, clean across the DOB and number boxes > A few fields slightly misaligned, but every piece of data made it onto the form The reasoning chain is the interesting part: it does the easy fields first, then works into the tight one-char-per-box fields, reasoning through y-coordinates, baselines, and label clearance in obsessive detail. The cost: 40:33 and 126.7k output tokens. That's a long think — but it's MiniMax's first multimodal model, and it nailed the content.

27,383 次观看

GLM 5.1 just went open-weight on Hugging Face, but how does it compare to GLM 5? I have tested both with the canvas tree challenge. 5.1 thinks longer, but delivers wind animation, sun, clouds, and way more detail. Prompt attached: Write a single HTML file with a full-page canvas, no libraries. Animate a tree that grows from the bottom center of the screen in real time. The trunk grows upward first, then branches split off recursively with slight randomness in angle and length. Each generation of branches should be thinner and slightly lighter in color. When branches reach their final size, add small leaves as soft green circles at the tips. The tree should take about 15 seconds to fully grow. Use warm brown for wood and varied greens for leaves against a soft sky-blue gradient background.

GLM 5.1 just went open-weight on Hugging Face, but how does it compare to GLM 5? I have tested both with the canvas tree challenge. 5.1 thinks longer, but delivers wind animation, sun, clouds, and way more detail. Prompt attached: Write a single HTML file with a full-page canvas, no libraries. Animate a tree that grows from the bottom center of the screen in real time. The trunk grows upward first, then branches split off recursively with slight randomness in angle and length. Each generation of branches should be thinner and slightly lighter in color. When branches reach their final size, add small leaves as soft green circles at the tips. The tree should take about 15 seconds to fully grow. Use warm brown for wood and varied greens for leaves against a soft sky-blue gradient background.

46,657 次观看

Introducing HermesAgent-20, a new Bench Pack for BenchLocal. 20 scenarios extracted straight from the Hermes Agent source code, run against a REAL Hermes instance. The actual workload you'd put your model through. Why I built BenchLocal in the first place: most benchmarks are too abstract. We use local LLMs for practical work, and finding the right model for YOUR task efficiently is the single most important thing, especially when you're constrained to what fits on your machine. BenchLocal is a framework: providers, models, side-by-side comparison, all in one UI. Bench Packs are the unit of testing: ToolCall-15 and BugFind-15 shipped first, and when I launched the BenchLocal 0.1.0, added StructOutput, ReasonMath, InstructFollow, DataExtract. Now, HermesAgent-20 is the newest. Bench Packs install like VS Code extensions. The SDK is open, write your own, share it, grow the ecosystem. Here's the goal: a community-built, practical evaluation layer for the local LLM space. Early numbers on HermesAgent-20: > GLM 5.1 — 85 > Gemma4 31B — 83 > Qwen3.5 27B — 79 > MiniMax M2.7 — 76 Upgrade to the latest BenchLocal to install HermesAgent-20 (SDK update required).

Introducing HermesAgent-20, a new Bench Pack for BenchLocal. 20 scenarios extracted straight from the Hermes Agent source code, run against a REAL Hermes instance. The actual workload you'd put your model through. Why I built BenchLocal in the first place: most benchmarks are too abstract. We use local LLMs for practical work, and finding the right model for YOUR task efficiently is the single most important thing, especially when you're constrained to what fits on your machine. BenchLocal is a framework: providers, models, side-by-side comparison, all in one UI. Bench Packs are the unit of testing: ToolCall-15 and BugFind-15 shipped first, and when I launched the BenchLocal 0.1.0, added StructOutput, ReasonMath, InstructFollow, DataExtract. Now, HermesAgent-20 is the newest. Bench Packs install like VS Code extensions. The SDK is open, write your own, share it, grow the ecosystem. Here's the goal: a community-built, practical evaluation layer for the local LLM space. Early numbers on HermesAgent-20: > GLM 5.1 — 85 > Gemma4 31B — 83 > Qwen3.5 27B — 79 > MiniMax M2.7 — 76 Upgrade to the latest BenchLocal to install HermesAgent-20 (SDK update required).

38,631 次观看

MiniMax M3 might be the most underrated coding model right now. I gave it nothing but a screenshot of a chaotic 90s GeoCities-style fan page, no HTML source, just the image + the asset files, and told it to rebuild the whole thing as a sleek Apple-style 2026 site. One shot. Through OpenCode. The result is genuinely stunning. It kept the soul (the "stevibe's HyperHome" identity, the visitor counter, the guestbook, the webmaster portrait) and translated every section into clean modern design, gradient hero, proper typography, dark theme, the works.

MiniMax M3 might be the most underrated coding model right now. I gave it nothing but a screenshot of a chaotic 90s GeoCities-style fan page, no HTML source, just the image + the asset files, and told it to rebuild the whole thing as a sleek Apple-style 2026 site. One shot. Through OpenCode. The result is genuinely stunning. It kept the soul (the "stevibe's HyperHome" identity, the visitor counter, the guestbook, the webmaster portrait) and translated every section into clean modern design, gradient hero, proper typography, dark theme, the works.

21,160 次观看

How well can Qwen3.5 models debug code? I built BugFind-15 — 15 buggy snippets across Python, JS, Rust, and Go. Docker sandbox compiles and validates every fix. Two trap scenarios where the code is correct and the model must resist "fixing" it. Tested every Qwen3.5 size from 0.8B to 397B, plus Jackrong's popular distilled model (V2). The 0.8B scored 5%. The 2B scored 10%. At 4B, debugging ability jumps to 69%. The hardest scenario: BF-03, a Rust trap. The code compiles fine — format! borrows, it doesn't move. Not a single model figured this out. From 0.8B to 397B, every one of them "fixed" a bug that doesn't exist. Category C (subtle bugs — mutable defaults, integer overflow, slice aliasing) was 100% across every model 4B and above. Category D (red herring resistance) told the real story — can it resist fixing code that isn't broken? No model scored above 90%. Small models can't debug. Mid-size models fix obvious bugs but fall for traps. Large models fix the hard bugs but still invent problems that don't exist.

How well can Qwen3.5 models debug code? I built BugFind-15 — 15 buggy snippets across Python, JS, Rust, and Go. Docker sandbox compiles and validates every fix. Two trap scenarios where the code is correct and the model must resist "fixing" it. Tested every Qwen3.5 size from 0.8B to 397B, plus Jackrong's popular distilled model (V2). The 0.8B scored 5%. The 2B scored 10%. At 4B, debugging ability jumps to 69%. The hardest scenario: BF-03, a Rust trap. The code compiles fine — format! borrows, it doesn't move. Not a single model figured this out. From 0.8B to 397B, every one of them "fixed" a bug that doesn't exist. Category C (subtle bugs — mutable defaults, integer overflow, slice aliasing) was 100% across every model 4B and above. Category D (red herring resistance) told the real story — can it resist fixing code that isn't broken? No model scored above 90%. Small models can't debug. Mid-size models fix obvious bugs but fall for traps. Large models fix the hard bugs but still invent problems that don't exist.

35,006 次观看

Been designing and experimenting with a new benchmark that stresses an underexplored angle: long tool-call chains with traps. The task: audit 36 packets, read 4 long-context ledgers, dodge retired/staging/wrong-quarter decoys, follow a strict workflow (auth → token → request → answer), submit the exact secret. Optimal: 52 calls. No call cap. I just measure how many calls each model burns to finish, and how many errors along the way. Threw 4 popular small models at it: 🥇 Qwen3.6 35B A3B (MoE) → 52 calls. Optimal. Zero errors. 🥈 Qwen3.6 27B (Dense) → 55 calls. Clean. ❌ Gemma4 31B (Dense) → 107 calls, 29 errors, looped writing auth/response.txt and re-reading auth/token.txt forever. ❌ Gemma4 26B A4B (MoE) → gave up at 13 (submitted the wrong answer). Other models I tested (GLM, DeepSeek) finish fine. So this isn't a task design issue, it's a Gemma4 issue with stateful workflows. Big models next.

Been designing and experimenting with a new benchmark that stresses an underexplored angle: long tool-call chains with traps. The task: audit 36 packets, read 4 long-context ledgers, dodge retired/staging/wrong-quarter decoys, follow a strict workflow (auth → token → request → answer), submit the exact secret. Optimal: 52 calls. No call cap. I just measure how many calls each model burns to finish, and how many errors along the way. Threw 4 popular small models at it: 🥇 Qwen3.6 35B A3B (MoE) → 52 calls. Optimal. Zero errors. 🥈 Qwen3.6 27B (Dense) → 55 calls. Clean. ❌ Gemma4 31B (Dense) → 107 calls, 29 errors, looped writing auth/response.txt and re-reading auth/token.txt forever. ❌ Gemma4 26B A4B (MoE) → gave up at 13 (submitted the wrong answer). Other models I tested (GLM, DeepSeek) finish fine. So this isn't a task design issue, it's a Gemma4 issue with stateful workflows. Big models next.

18,438 次观看

One prompt. 6 frontier coding models. "Create a realistic fireworks show using HTML Canvas and JavaScript. No libraries." Some built a whole celebration. Others... lit a sparkler. The lineup: - GPT-5.3 Codex - Claude Opus 4.6 - Gemini 3.1 Pro - MiniMax M2.7 - GLM-5 - Kimi K2.5

One prompt. 6 frontier coding models. "Create a realistic fireworks show using HTML Canvas and JavaScript. No libraries." Some built a whole celebration. Others... lit a sparkler. The lineup: - GPT-5.3 Codex - Claude Opus 4.6 - Gemini 3.1 Pro - MiniMax M2.7 - GLM-5 - Kimi K2.5

14,152 次观看

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

This looks like a toy. It's actually the meanest little vision eval I've built. The task: look at an emoji image, then repaint it on a 16×16 grid, one pixel at a time. Just the model, a tiny canvas, and up to 2000 brushstrokes. What I didn't expect was the personalities. > some models REGRET a stroke and go back to repaint it > some get stuck looping the same little patch over and over, like they're trying to animate it > some are calm little surgeons and just nail it first try And the task is genuinely mean: it has to see the image, crush it down to 256 cells, then decide what's actually load-bearing: > the tears on 😂 but still keep the smile > the horn on 🦄 > the antenna on 🤖 and keep the soul of it with almost no resolution to spare. 5 models. 7 emojis. Best of 5 runs each. Side by side. Who's your winner?

This looks like a toy. It's actually the meanest little vision eval I've built. The task: look at an emoji image, then repaint it on a 16×16 grid, one pixel at a time. Just the model, a tiny canvas, and up to 2000 brushstrokes. What I didn't expect was the personalities. > some models REGRET a stroke and go back to repaint it > some get stuck looping the same little patch over and over, like they're trying to animate it > some are calm little surgeons and just nail it first try And the task is genuinely mean: it has to see the image, crush it down to 256 cells, then decide what's actually load-bearing: > the tears on 😂 but still keep the smile > the horn on 🦄 > the antenna on 🤖 and keep the soul of it with almost no resolution to spare. 5 models. 7 emojis. Best of 5 runs each. Side by side. Who's your winner?

283,753 次观看 • 25 天前

GLM 5.1 vs GLM 5.2 6 advanced HTML canvas challenges: 💧 Ink diffusing in water ⚔️ Energy-blade duel 📱 Slide to unlock 🅿️ 360° parking assist 🔥 Burning letter to ash 🏠 Build-a-house sequence Pure canvas, zero libraries.

GLM 5.1 vs GLM 5.2 6 advanced HTML canvas challenges: 💧 Ink diffusing in water ⚔️ Energy-blade duel 📱 Slide to unlock 🅿️ 360° parking assist 🔥 Burning letter to ash 🏠 Build-a-house sequence Pure canvas, zero libraries.

346,152 次观看 • 1 个月前

your physics textbook is not boring anymore Hooke's Law with live text reflow around an actual bouncing simulation. 60fps. zero layout thrashing. Cheng Lou what have you unleashed

your physics textbook is not boring anymore Hooke's Law with live text reflow around an actual bouncing simulation. 60fps. zero layout thrashing. Cheng Lou what have you unleashed

1,253,133 次观看 • 3 个月前

Claude Opus 4.7 vs 4.8 side-by-side canvas test

Claude Opus 4.7 vs 4.8 side-by-side canvas test

475,858 次观看 • 1 个月前

You know that "But, wait..." moment in every LLM thinking trace? I made it visible. I asked 8 models the same tricky probability question and rendered their reasoning as trees. Every time a model rejects its own idea and pivots, every "But...", every "Wait, actually...", a new branch grows. Same question. Completely different minds.

You know that "But, wait..." moment in every LLM thinking trace? I made it visible. I asked 8 models the same tricky probability question and rendered their reasoning as trees. Every time a model rejects its own idea and pivots, every "But...", every "Wait, actually...", a new branch grows. Same question. Completely different minds.

86,285 次观看 • 15 天前

Opus 4.7 first-hour impressions Ran the canvas tree growth test twice. 4.6: nailed the animation both times 4.7: static tree, no growth animation — twice 4.7's thinking is noticeably shorter and faster though (trimmed some 4.6 thinking in the clip for pacing). Not the upgrade direction I expected on this one.

Opus 4.7 first-hour impressions Ran the canvas tree growth test twice. 4.6: nailed the animation both times 4.7: static tree, no growth animation — twice 4.7's thinking is noticeably shorter and faster though (trimmed some 4.6 thinking in the clip for pacing). Not the upgrade direction I expected on this one.

487,981 次观看 • 3 个月前

Qwen3.6 35B-A3B dropped yesterday, so I ran it on 4 GPUs to see how it performs: 🟣 RTX 3090 — 49.78 tok/s, TTFT 852ms 🟡 RTX 4090 — 118.93 tok/s, TTFT 686ms 🟢 RTX 5090 — 160.37 tok/s, TTFT 409ms 🔵 DGX Spark — 59.98 tok/s, TTFT 228ms I went with ollama as the backend because honestly, it's the easiest way for most people to get started. One command, model pulled, done. I used Q4_K_M (24GB) across all four cards. The reason is the 3090 and 4090 don't support NVFP4 (only the 5090 and DGX Spark could use it). Keeping the same quant everywhere felt like the fairest way to compare. And yes, you can absolutely squeeze more performance out of every card with vLLM, SGLang, or TensorRT-LLM. But that's not what this test is about. This is just the out-of-the-box experience for folks who own a GPU and want to try the new model tonight.

Qwen3.6 35B-A3B dropped yesterday, so I ran it on 4 GPUs to see how it performs: 🟣 RTX 3090 — 49.78 tok/s, TTFT 852ms 🟡 RTX 4090 — 118.93 tok/s, TTFT 686ms 🟢 RTX 5090 — 160.37 tok/s, TTFT 409ms 🔵 DGX Spark — 59.98 tok/s, TTFT 228ms I went with ollama as the backend because honestly, it's the easiest way for most people to get started. One command, model pulled, done. I used Q4_K_M (24GB) across all four cards. The reason is the 3090 and 4090 don't support NVFP4 (only the 5090 and DGX Spark could use it). Keeping the same quant everywhere felt like the fairest way to compare. And yes, you can absolutely squeeze more performance out of every card with vLLM, SGLang, or TensorRT-LLM. But that's not what this test is about. This is just the out-of-the-box experience for folks who own a GPU and want to try the new model tonight.

398,902 次观看 • 3 个月前

"Its (Sonnet 5) performance is close to Opus 4.8, at lower prices." So I ran 4 canvas test through both. > Opus 4.8, 4/4 actually animating. > Sonnet 5, 2/4 came back as static images. And "lower price"? On the paper shredder task, Sonnet 5 spent $0.36 for a static image. Opus 4.8 spent $0.18 and it actually animated. The 4 tests: > Win 98 drag-to-BSOD > Self-typing keyboard + CRT > Letter burning > Paper shredder

"Its (Sonnet 5) performance is close to Opus 4.8, at lower prices." So I ran 4 canvas test through both. > Opus 4.8, 4/4 actually animating. > Sonnet 5, 2/4 came back as static images. And "lower price"? On the paper shredder task, Sonnet 5 spent $0.36 for a static image. Opus 4.8 spent $0.18 and it actually animated. The 4 tests: > Win 98 drag-to-BSOD > Self-typing keyboard + CRT > Letter burning > Paper shredder

60,293 次观看 • 19 天前

I gave two MoE models the same vibe coding challenge Qwen3.6 35B A3B (31.8GB) vs Gemma4 26B A4B (23.3GB) Stack: > Unsloth Q6_K_XL > llama.cpp > Model-card recommended sampling for each 4 prompts, side-by-side. Which one do you think wins?

I gave two MoE models the same vibe coding challenge Qwen3.6 35B A3B (31.8GB) vs Gemma4 26B A4B (23.3GB) Stack: > Unsloth Q6_K_XL > llama.cpp > Model-card recommended sampling for each 4 prompts, side-by-side. Which one do you think wins?

261,895 次观看 • 3 个月前

I gave Kimi K2.6 and K2.7 Code the exact same prompt to animate a letter burning to ash. Pure HTML canvas, zero libraries.

I gave Kimi K2.6 and K2.7 Code the exact same prompt to animate a letter burning to ash. Pure HTML canvas, zero libraries.

89,828 次观看 • 1 个月前

MiniMax M2.7 is 230B params. Can you actually run it at home? I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs: 🟠 4x RTX 4090 (96GB): 71.52 tok/s, TTFT 1045ms 🟢 4x RTX 5090 (128GB): 120.54 tok/s, TTFT 725ms 🟡 1x RTX PRO 6000 (96GB): 118.74 tok/s, TTFT 765ms 🟣 DGX Spark (128GB) — 24.41 tok/s, TTFT 741ms Backend: llama.cpp. Context: 32k. Max tokens: 4096. I went with IQ3_XXS because it's the biggest quant that fits in 96GB VRAM while still leaving safe headroom for 32k context. Same quant across all four rigs, fairest comparison I could run. Now look at rough peak GPU power draw: 🟠 4x4090 → 1,800W peak (450W × 4) 🟢 4x5090 → 2,300W peak (575W × 4) 🟡 RTX PRO 6000 → 600W peak 🟣 DGX Spark → 240W peak (whole system) The RTX PRO 6000 is the quiet winner. One card, 96GB, matching a 4x5090 rig at roughly a quarter of the power and zero multi-GPU headaches. Best tokens-per-watt by a wide margin. DGX Spark is slow on generation but pulls the least power of any rig here, around 240W for the whole system. Prefill-friendly, memory-rich, wall-socket-friendly. And yes, plenty of people cap their cards. Even then, 4x 4090 or 4x 5090 still pulls well over 1,200W from the GPUs alone.

MiniMax M2.7 is 230B params. Can you actually run it at home? I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs: 🟠 4x RTX 4090 (96GB): 71.52 tok/s, TTFT 1045ms 🟢 4x RTX 5090 (128GB): 120.54 tok/s, TTFT 725ms 🟡 1x RTX PRO 6000 (96GB): 118.74 tok/s, TTFT 765ms 🟣 DGX Spark (128GB) — 24.41 tok/s, TTFT 741ms Backend: llama.cpp. Context: 32k. Max tokens: 4096. I went with IQ3_XXS because it's the biggest quant that fits in 96GB VRAM while still leaving safe headroom for 32k context. Same quant across all four rigs, fairest comparison I could run. Now look at rough peak GPU power draw: 🟠 4x4090 → 1,800W peak (450W × 4) 🟢 4x5090 → 2,300W peak (575W × 4) 🟡 RTX PRO 6000 → 600W peak 🟣 DGX Spark → 240W peak (whole system) The RTX PRO 6000 is the quiet winner. One card, 96GB, matching a 4x5090 rig at roughly a quarter of the power and zero multi-GPU headaches. Best tokens-per-watt by a wide margin. DGX Spark is slow on generation but pulls the least power of any rig here, around 240W for the whole system. Prefill-friendly, memory-rich, wall-socket-friendly. And yes, plenty of people cap their cards. Even then, 4x 4090 or 4x 5090 still pulls well over 1,200W from the GPUs alone.

191,782 次观看 • 3 个月前

3 ways to destroy a piece of paper. Qwen 3.5 35B A3B vs. Ornith 1.0 35B, side-by-side canvas test. (Why 3.5 not 3.6? Ornith is post-trained on Qwen 3.5 and Gemma 4, so this shows what the post-training adds.) Same 3 challenges: 🔪 Slice: three blade swipes, fruit-game style 📄 Shredder: desktop strip-cut 🗑️ Crumple: balled up and tossed Winner: not close. Ornith, decisively. The post-training quality is REAL.

3 ways to destroy a piece of paper. Qwen 3.5 35B A3B vs. Ornith 1.0 35B, side-by-side canvas test. (Why 3.5 not 3.6? Ornith is post-trained on Qwen 3.5 and Gemma 4, so this shows what the post-training adds.) Same 3 challenges: 🔪 Slice: three blade swipes, fruit-game style 📄 Shredder: desktop strip-cut 🗑️ Crumple: balled up and tossed Winner: not close. Ornith, decisively. The post-training quality is REAL.

47,262 次观看 • 23 天前

Qwen3.5:9b reasoning head-to-head: Mac Studio M2 Ultra 64GB: 43.08 tok/s Mac Mini M4 16GB: 13.07 tok/s Qwen

Qwen3.5:9b reasoning head-to-head: Mac Studio M2 Ultra 64GB: 43.08 tok/s Mac Mini M4 16GB: 13.07 tok/s Qwen

243,319 次观看 • 4 个月前

Meituan's LongCat-2.0 reportedly lands near GPT-5.5 on SWE-bench. So I threw 5 HTML canvas animation prompts at both. 🥷 Paper sliced fruit-ninja style. 💧 An ink drop diffusing in water. 🔥 A letter burning. 🗑️ Paper crumpling into a ball. ✂️ A strip-cut shredder. Here's how they did 👇

Meituan's LongCat-2.0 reportedly lands near GPT-5.5 on SWE-bench. So I threw 5 HTML canvas animation prompts at both. 🥷 Paper sliced fruit-ninja style. 💧 An ink drop diffusing in water. 🔥 A letter burning. 🗑️ Paper crumpling into a ball. ✂️ A strip-cut shredder. Here's how they did 👇

34,669 次观看 • 20 天前

Finally got my hands on the big one. Qwen3.5-122B-A10B — 122 billion parameters. Too big for any single consumer GPU. So I rented 4 of each... and then one professional card to see if brute force even matters. - 1x RTX PRO 6000 (96GB): 101.4 tok/s - 4x 5090 (128GB): 87.0 tok/s - 4x 4090 (96GB): 25.1 tok/s - 4x 3090 (96GB): 20.8 tok/s One single $8,500 card beat four RTX 5090s

Finally got my hands on the big one. Qwen3.5-122B-A10B — 122 billion parameters. Too big for any single consumer GPU. So I rented 4 of each... and then one professional card to see if brute force even matters. - 1x RTX PRO 6000 (96GB): 101.4 tok/s - 4x 5090 (128GB): 87.0 tok/s - 4x 4090 (96GB): 25.1 tok/s - 4x 3090 (96GB): 20.8 tok/s One single $8,500 card beat four RTX 5090s

195,832 次观看 • 4 个月前

Mistral OCR 4 just dropped with bounding boxes (their most-requested feature) so I plugged it into my form-filling test as the helper model. Qwen3.6 reasons, Mistral localizes. Result? Boxes detected, fields filled, mostly landing in the lines. Not pixel-perfect. But close? Yeah, I'll call it close.

Mistral OCR 4 just dropped with bounding boxes (their most-requested feature) so I plugged it into my form-filling test as the helper model. Qwen3.6 reasons, Mistral localizes. Result? Boxes detected, fields filled, mostly landing in the lines. Not pixel-perfect. But close? Yeah, I'll call it close.

41,430 次观看 • 27 天前

Completed a first hour side-by-side comparison between Qwen3.5 27b and Qwen3.6 27b on the same 4 canvas coding tests. Running the Qwen3.6 27b FP8, vLLM. What do you think?

Completed a first hour side-by-side comparison between Qwen3.5 27b and Qwen3.6 27b on the same 4 canvas coding tests. Running the Qwen3.6 27b FP8, vLLM. What do you think?

114,191 次观看 • 2 个月前

Qwen3.6 27B landed yesterday, so I ran it on 4 setups side-by-side to see how they stack up: 🔴 RTX 4090 — 45.59 tok/s, TTFT 525ms 🟢 RTX 5090 — 51.83 tok/s, TTFT 752ms ⚫️ M2 Ultra — 22.30 tok/s, TTFT 216ms 🟣 DGX Spark — 11.08 tok/s, TTFT 319ms This is a standard test: no tuning, just the out-of-the-box experience. For the NVIDIA cards I used llama.cpp with Unsloth's UD-Q4_K_XL quant. For the M2 Ultra I used MLX with Unsloth's UD-MLX-4bit quant, since MLX is the native path on Apple Silicon. Please consider this as the baseline, you can definitely squeeze more out of every one of these with fine-tuned settings.

Qwen3.6 27B landed yesterday, so I ran it on 4 setups side-by-side to see how they stack up: 🔴 RTX 4090 — 45.59 tok/s, TTFT 525ms 🟢 RTX 5090 — 51.83 tok/s, TTFT 752ms ⚫️ M2 Ultra — 22.30 tok/s, TTFT 216ms 🟣 DGX Spark — 11.08 tok/s, TTFT 319ms This is a standard test: no tuning, just the out-of-the-box experience. For the NVIDIA cards I used llama.cpp with Unsloth's UD-Q4_K_XL quant. For the M2 Ultra I used MLX with Unsloth's UD-MLX-4bit quant, since MLX is the native path on Apple Silicon. Please consider this as the baseline, you can definitely squeeze more out of every one of these with fine-tuned settings.

104,345 次观看 • 2 个月前

So we know Gemma 4 is good at tool calling, but what about web coding? I threw 4 UI screenshots at three Gemma 4 models and said rebuild this, one shot, no hand-holding, just image in, code out. Model lineup: - E4B - 26B A4B (MoE) - 31B Dense (skipped the E2B this round) Let me know which one you think cooked the hardest

So we know Gemma 4 is good at tool calling, but what about web coding? I threw 4 UI screenshots at three Gemma 4 models and said rebuild this, one shot, no hand-holding, just image in, code out. Model lineup: - E4B - 26B A4B (MoE) - 31B Dense (skipped the E2B this round) Let me know which one you think cooked the hardest

124,839 次观看 • 3 个月前

The RTX 3090 is a 5-year-old GPU and it still runs a 27B model at 20 tok/s I tested Qwen3.5:27b across 3 generations of NVIDIA: 5090 → ~60 tok/s 4090 → ~40 tok/s 3090 → ~20 tok/s Perfectly linear scaling. Double the generation, double the speed.

The RTX 3090 is a 5-year-old GPU and it still runs a 27B model at 20 tok/s I tested Qwen3.5:27b across 3 generations of NVIDIA: 5090 → ~60 tok/s 4090 → ~40 tok/s 3090 → ~20 tok/s Perfectly linear scaling. Double the generation, double the speed.

142,493 次观看 • 4 个月前