Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

How well can Qwen3.5 models debug code? I built BugFind-15 — 15 buggy snippets across Python, JS, Rust, and Go. Docker sandbox compiles and validates every fix. Two trap scenarios where the code is correct and the model must resist "fixing" it. Tested every Qwen3.5 size from 0.8B to... 397B, plus Jackrong's popular distilled model (V2). The 0.8B scored 5%. The 2B scored 10%. At 4B, debugging ability jumps to 69%. The hardest scenario: BF-03, a Rust trap. The code compiles fine — format! borrows, it doesn't move. Not a single model figured this out. From 0.8B to 397B, every one of them "fixed" a bug that doesn't exist. Category C (subtle bugs — mutable defaults, integer overflow, slice aliasing) was 100% across every model 4B and above. Category D (red herring resistance) told the real story — can it resist fixing code that isn't broken? No model scored above 90%. Small models can't debug. Mid-size models fix obvious bugs but fall for traps. Large models fix the hard bugs but still invent problems that don't exist.show more

stevibe

20,708 subscribers

35,006 views • 2 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Got a 16GB GPU? You can run all of these right now. Tested 4 Qwen3.5-based models on ToolCall-15 & BugFind-15: Models: - Qwen3.5:9b Q8 (Official) - Qwopus v3 Q8 by Jackrong - OmniCoder-9B by Tesslate - Qwen3.5-9b-Sushi-Coder by bigatuna Summary: - ToolCall-15: Qwopus v3 went perfect 30/30, Sushicoder beat base Qwen3.5 - BugFind-15: Omnicoder flipped the script and took #1 at 83% No single model won both, that's the fun part. Open source community is cooking.

Got a 16GB GPU? You can run all of these right now. Tested 4 Qwen3.5-based models on ToolCall-15 & BugFind-15: Models: - Qwen3.5:9b Q8 (Official) - Qwopus v3 Q8 by Jackrong - OmniCoder-9B by Tesslate - Qwen3.5-9b-Sushi-Coder by bigatuna Summary: - ToolCall-15: Qwopus v3 went perfect 30/30, Sushicoder beat base Qwen3.5 - BugFind-15: Omnicoder flipped the script and took #1 at 83% No single model won both, that's the fun part. Open source community is cooking.

stevibe

75,125 views • 2 months ago

Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together. (The test: place each element at the right pixel position on a blank form image, not type into a field.) Setup: > Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool). > I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height. > The blue boxes on the screen are its detections. Look how tight they are — it nails every field. Result: > Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas. > Character-box alignment still a touch loose, but every value is where it belongs. > 9m10s, 224.5k input, 24.3k output, 21 turns. Why it matters: > Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can. > A combination of small models can do the work of a single large one.

Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together. (The test: place each element at the right pixel position on a blank form image, not type into a field.) Setup: > Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool). > I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height. > The blue boxes on the screen are its detections. Look how tight they are — it nails every field. Result: > Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas. > Character-box alignment still a touch loose, but every value is where it belongs. > 9m10s, 224.5k input, 24.3k output, 21 turns. Why it matters: > Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can. > A combination of small models can do the work of a single large one.

stevibe

146,752 views • 10 days ago

"I'm not a human." Fed it to Qwen 3.5 0.8B running locally on my Mac Studio M2 Ultra. It solved it. The CAPTCHA is fake. But sending images to the local model? Very real. I'm not breaking the internet. Yet.

"I'm not a human." Fed it to Qwen 3.5 0.8B running locally on my Mac Studio M2 Ultra. It solved it. The CAPTCHA is fake. But sending images to the local model? Very real. I'm not breaking the internet. Yet.

stevibe

69,433 views • 3 months ago

✨ I revived my first AI startup from 6 years ago with Claude Code [ 💡 ] Back then it used GPT-3 (this was 2 years before ChatGPT existed!) to generate new startup ideas which then people can vote on And the best startup ideas rise to the top! Back then I made it because people complained they didn't have any ideas to build a startup This week I moved it to its own VPS and installed Claude Code and told it to fix everything, the DB had become big and there was stupid write operations on every page load that it made it very slow Claude Code is excellent at fixing all those small bugs from old projects and quickly fixing them As Garry Tan says "boil the oceans" as in before I'd not have the time to fix these kinds of projects, it wouldn't be worth it, I mean IdeasAI doesn't even make money, but now it takes me an hour to do this and it works again! I also upgraded GPT-3 to xAI's Grok 4.2 for new startup ideas

✨ I revived my first AI startup from 6 years ago with Claude Code [ 💡 ] Back then it used GPT-3 (this was 2 years before ChatGPT existed!) to generate new startup ideas which then people can vote on And the best startup ideas rise to the top! Back then I made it because people complained they didn't have any ideas to build a startup This week I moved it to its own VPS and installed Claude Code and told it to fix everything, the DB had become big and there was stupid write operations on every page load that it made it very slow Claude Code is excellent at fixing all those small bugs from old projects and quickly fixing them As Garry Tan says "boil the oceans" as in before I'd not have the time to fix these kinds of projects, it wouldn't be worth it, I mean IdeasAI doesn't even make money, but now it takes me an hour to do this and it works again! I also upgraded GPT-3 to xAI's Grok 4.2 for new startup ideas

@levelsio

179,845 views • 2 months ago

My first PhD paper!🎉We learn *diffusion* models for code generation that learn to directly *edit* syntax trees of programs. The result is a system that can incrementally write code, see the execution output, and debug it. 🧵1/n

My first PhD paper!🎉We learn diffusion models for code generation that learn to directly edit syntax trees of programs. The result is a system that can incrementally write code, see the execution output, and debug it. 🧵1/n

Shreyas Kapur

742,386 views • 2 years ago

🚨Anthropic just gave Claude Code eyes and hands. read that again. it can now open apps on mac, click around macOS, find bugs visually, screenshot them, fix the code, rebuild, and verify the fix. one prompt. absolute zero human input. full autonomy. seems like Claude Code is becoming a super app. i’m so here for it.

🚨Anthropic just gave Claude Code eyes and hands. read that again. it can now open apps on mac, click around macOS, find bugs visually, screenshot them, fix the code, rebuild, and verify the fix. one prompt. absolute zero human input. full autonomy. seems like Claude Code is becoming a super app. i’m so here for it.

sui ☄️

27,042 views • 2 months ago

let me save you 3 hours of head scratching. if you're running local models like Qwen3.5-35B-A3B through Claude Code via llama.cpp's Anthropic endpoint, the chain will break every 3 to 5 minutes. tool call fails. flow stops. you reprompt. it recovers. 2 minutes later it stops again. the model is fine. the harness chokes on local inference latency. switch to OpenCode. same localhost endpoint. same model. same GPU. the chain doesn't break. the tradeoff: OpenCode sometimes loops. the model forgets what it already read and repeats the same tool call. but a loop you can interrupt. a broken chain kills your momentum and you start over. watch both side by side. proprietary agent vs open source agent. same 3B model. different failure modes. pick your poison.

let me save you 3 hours of head scratching. if you're running local models like Qwen3.5-35B-A3B through Claude Code via llama.cpp's Anthropic endpoint, the chain will break every 3 to 5 minutes. tool call fails. flow stops. you reprompt. it recovers. 2 minutes later it stops again. the model is fine. the harness chokes on local inference latency. switch to OpenCode. same localhost endpoint. same model. same GPU. the chain doesn't break. the tradeoff: OpenCode sometimes loops. the model forgets what it already read and repeats the same tool call. but a loop you can interrupt. a broken chain kills your momentum and you start over. watch both side by side. proprietary agent vs open source agent. same 3B model. different failure modes. pick your poison.

Sudo su

72,501 views • 3 months ago

MiniMax is the James Bond of AI agents. It uses the world's first open-weight model (MiniMax-M1), and it squeezes every bit of power from it. The agent takes a prompt and does more than any other agent in the market right now: 1. It can do Deep Research 2. It can write code 3. It can design web pages 4. It can build 3D models I built 5 different experiences using MiniMax and recorded them for you:

MiniMax is the James Bond of AI agents. It uses the world's first open-weight model (MiniMax-M1), and it squeezes every bit of power from it. The agent takes a prompt and does more than any other agent in the market right now: 1. It can do Deep Research 2. It can write code 3. It can design web pages 4. It can build 3D models I built 5 different experiences using MiniMax and recorded them for you:

Santiago

44,730 views • 11 months ago

Qwen3.5-35B-A3B is now in Jan 🔥 It surpasses previous Qwen3 models more than 6× its size. Get the latest Jan at Thanks to Qwen for the base model and Georgi Gerganov for llama.cpp 💛

Qwen3.5-35B-A3B is now in Jan 🔥 It surpasses previous Qwen3 models more than 6× its size. Get the latest Jan at Thanks to Qwen for the base model and Georgi Gerganov for llama.cpp 💛

👋 Jan

34,433 views • 3 months ago

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 views • 11 months ago

Working with multiple models in Chat? The model picker in VS Code is now organized by provider, making it easier to browse, search, and switch between models. You'll also see provider names next to your recent models for quicker recognition. 💡 Tip: Use /models for quick access.

Working with multiple models in Chat? The model picker in VS Code is now organized by provider, making it easier to browse, search, and switch between models. You'll also see provider names next to your recent models for quicker recognition. 💡 Tip: Use /models for quick access.

Visual Studio Code

32,599 views • 29 days ago

Introducing DeepSeek Coder! - SOTA large coding models with params ranging from 1.3B to 33B. - Building games, testing code, fixing bugs, and analyzing data... You dream it, we make it. - Free for commercial use and fully open-source. Try it out now at

Introducing DeepSeek Coder! - SOTA large coding models with params ranging from 1.3B to 33B. - Building games, testing code, fixing bugs, and analyzing data... You dream it, we make it. - Free for commercial use and fully open-source. Try it out now at

DeepSeek

162,557 views • 2 years ago

My project has 39,205 lines of code, and Cursor can't answer questions about it. Cursor's context seems to be capped at around 10,000 tokens. Unfortunately, this is not enough for any decent-sized project. If you have a large codebase, check out Augment Code. This thing is faaaast! I'm currently using their Visual Studio Code plugin, but you can also use them on JetBrains, Neovim, and even Vim. (I'm a Neovim fan, but Copilot's implementation for Neovim is nowhere as good as Augment Code.) Augment Code was gracious enough to sponsor this post. After you install their extension and run it for the first time, it will index your entire codebase. This is why it can answer questions as fast as it does, regardless of the size of your codebase. Augment Code supports chat and completions like every other AI coding assistant, but its killer feature is "Next Edit." When you make a change, two things happen: 1. The model analyzes the change to determine the ripple effects across your *entire* codebase. 2. The model suggests everything you need to update to ensure everything works correctly. This is pretty wild!

My project has 39,205 lines of code, and Cursor can't answer questions about it. Cursor's context seems to be capped at around 10,000 tokens. Unfortunately, this is not enough for any decent-sized project. If you have a large codebase, check out Augment Code. This thing is faaaast! I'm currently using their Visual Studio Code plugin, but you can also use them on JetBrains, Neovim, and even Vim. (I'm a Neovim fan, but Copilot's implementation for Neovim is nowhere as good as Augment Code.) Augment Code was gracious enough to sponsor this post. After you install their extension and run it for the first time, it will index your entire codebase. This is why it can answer questions as fast as it does, regardless of the size of your codebase. Augment Code supports chat and completions like every other AI coding assistant, but its killer feature is "Next Edit." When you make a change, two things happen: 1. The model analyzes the change to determine the ripple effects across your entire codebase. 2. The model suggests everything you need to update to ensure everything works correctly. This is pretty wild!

Santiago

247,775 views • 1 year ago

Not every task needs the same model. A quick summary doesn't need the same horsepower as a deep research question — and it shouldn't cost the same either. Now, you can now easily compare models for your Custom Agent on speed, intelligence, and cost 🫡

Not every task needs the same model. A quick summary doesn't need the same horsepower as a deep research question — and it shouldn't cost the same either. Now, you can now easily compare models for your Custom Agent on speed, intelligence, and cost 🫡

Notion

32,420 views • 3 months ago

Introducing HermesAgent-20, a new Bench Pack for BenchLocal. 20 scenarios extracted straight from the Hermes Agent source code, run against a REAL Hermes instance. The actual workload you'd put your model through. Why I built BenchLocal in the first place: most benchmarks are too abstract. We use local LLMs for practical work, and finding the right model for YOUR task efficiently is the single most important thing, especially when you're constrained to what fits on your machine. BenchLocal is a framework: providers, models, side-by-side comparison, all in one UI. Bench Packs are the unit of testing: ToolCall-15 and BugFind-15 shipped first, and when I launched the BenchLocal 0.1.0, added StructOutput, ReasonMath, InstructFollow, DataExtract. Now, HermesAgent-20 is the newest. Bench Packs install like VS Code extensions. The SDK is open, write your own, share it, grow the ecosystem. Here's the goal: a community-built, practical evaluation layer for the local LLM space. Early numbers on HermesAgent-20: > GLM 5.1 — 85 > Gemma4 31B — 83 > Qwen3.5 27B — 79 > MiniMax M2.7 — 76 Upgrade to the latest BenchLocal to install HermesAgent-20 (SDK update required).

Introducing HermesAgent-20, a new Bench Pack for BenchLocal. 20 scenarios extracted straight from the Hermes Agent source code, run against a REAL Hermes instance. The actual workload you'd put your model through. Why I built BenchLocal in the first place: most benchmarks are too abstract. We use local LLMs for practical work, and finding the right model for YOUR task efficiently is the single most important thing, especially when you're constrained to what fits on your machine. BenchLocal is a framework: providers, models, side-by-side comparison, all in one UI. Bench Packs are the unit of testing: ToolCall-15 and BugFind-15 shipped first, and when I launched the BenchLocal 0.1.0, added StructOutput, ReasonMath, InstructFollow, DataExtract. Now, HermesAgent-20 is the newest. Bench Packs install like VS Code extensions. The SDK is open, write your own, share it, grow the ecosystem. Here's the goal: a community-built, practical evaluation layer for the local LLM space. Early numbers on HermesAgent-20: > GLM 5.1 — 85 > Gemma4 31B — 83 > Qwen3.5 27B — 79 > MiniMax M2.7 — 76 Upgrade to the latest BenchLocal to install HermesAgent-20 (SDK update required).

stevibe

38,620 views • 1 month ago

I find it funny how they made two versions of the same Disney Channel Wand ID with the models of "Twice Upon a Christmas" and "Mickey Mouse Clubhouse" and realized that the MMC model was superior. Apparently the Twice model was for that scrapped "Search Of Mickey Mouse" film.

I find it funny how they made two versions of the same Disney Channel Wand ID with the models of "Twice Upon a Christmas" and "Mickey Mouse Clubhouse" and realized that the MMC model was superior. Apparently the Twice model was for that scrapped "Search Of Mickey Mouse" film.

Sebastián Córdova

640,570 views • 5 months ago

PAYING PER MODEL IS THE DUMBEST THING IN TECH RIGHT NOW i was paying 3x what i needed to for AI inference the grid lets you buy a quality spec instead of a specific model.. it routes every request in real time to the cheapest option that qualifies swap one url and your code keeps working exactly the same openai-compatible, one line to switch, 200M free tokens to start

PAYING PER MODEL IS THE DUMBEST THING IN TECH RIGHT NOW i was paying 3x what i needed to for AI inference the grid lets you buy a quality spec instead of a specific model.. it routes every request in real time to the cheapest option that qualifies swap one url and your code keeps working exactly the same openai-compatible, one line to switch, 200M free tokens to start

Robin Delta

15,729 views • 16 days ago

Coding just changed forever. You can now vibe-code with Windsurf and auto-fix bugs with CodeRabbit. Here’s how I vibe-fix my code + (how to try free):

Coding just changed forever. You can now vibe-code with Windsurf and auto-fix bugs with CodeRabbit. Here’s how I vibe-fix my code + (how to try free):

Robin Delta

58,517 views • 1 year ago

Very strange behavior in Claude Code: after specifically telling Claude to use gemini-3-flash-preview, (and even explaining that it's a new model that just came out), Claude refuses to use it, getting stuck in a loop of making non-edits which keep the old version. It gets skeptical and apparently doesn't believe that the model exists (?)

Very strange behavior in Claude Code: after specifically telling Claude to use gemini-3-flash-preview, (and even explaining that it's a new model that just came out), Claude refuses to use it, getting stuck in a loop of making non-edits which keep the old version. It gets skeptical and apparently doesn't believe that the model exists (?)

Benjamin De Kraker

34,305 views • 5 months ago