Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

I designed a new test specifically for multimodal models: fill out a paper form. And it's much harder than it sounds. This isn't typing into an electronic field that captures your text. The form is just an image. The model has to place each form element: text, checkmarks —... at the correct pixel position on the canvas itself. Results: 🟢 Kimi K2.6 → done in 3:45, 16.7k output tokens 🟡 Step 3.7 Flash → half the fields, 57k output tokens 🔴 Gemini 3.5 Flash → 489k output tokens, never finished. I had to kill it. Gemini burned ~29x more output tokens than Kimi on the exact same task, and Kimi's was the only form that actually looked filled out. The test, a mocked application form, contains some challenging parts, such as one-character-per-box fields. I provided every model the same set of tools: > get canvas size > drop probe markers to find coordinates > add text > add checkmarks > move elements > take a screenshot anytime to check their own work > ... etc So it's vision + spatial reasoning + tool use + long context, all at once. Small models (Qwen, Gemma) can't really complete this test, so I skipped them. What happened: > Kimi nailed name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code — placement slightly loose, but content correct. 15 turns. Clean. > Step got maybe half right — fields dropped, "United States" landed in the email line, data floating outside boxes. Burned 1.24M input tokens doing it (81 turns of re-reading the canvas). > Gemini almost got there visually... then spiraled. By turn 40 it was issuing a delete_elements call wiping element IDs 365–425, basically erasing its own work. 31 minutes, 489k output tokens, still streaming. Terminated. The takeaway isn't "Gemini bad." This test is indeed difficult. But token efficiency is capability now. A model that needs 30x the tokens and still can't converge is going to be 30x the cost in production. Kimi K2.6 just quietly did the thing.show more

stevibe

26,602 subscribers

25,304 görüntüleme • 1 ay önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Open-weight MiniMax M3 filled out a US customs form from a driver's license photo For this test we deployed MiniMax M3 Q4 using MLX-VLM on a Mac Studio M3 Ultra 512GB RAM. The model was tasked with reading a scanned document and an ID card photo, then completing a declaration form Output: 736 tokens · Input: 1,847 tokens · Time: ~31s The model analyzed both inputs, streamed its reasoning, and then called three tools: write_field for text fields, mark for Yes/No checkboxes, and sign for the signature and date. It extracted the required information, mapped it to the correct fields and completed the form without any manual input

Open-weight MiniMax M3 filled out a US customs form from a driver's license photo For this test we deployed MiniMax M3 Q4 using MLX-VLM on a Mac Studio M3 Ultra 512GB RAM. The model was tasked with reading a scanned document and an ID card photo, then completing a declaration form Output: 736 tokens · Input: 1,847 tokens · Time: ~31s The model analyzed both inputs, streamed its reasoning, and then called three tools: write_field for text fields, mark for Yes/No checkboxes, and sign for the signature and date. It extracted the required information, mapped it to the correct fields and completed the form without any manual input

atomic.chat

108,971 görüntüleme • 17 gün önce

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

kwindla

40,319 görüntüleme • 1 ay önce

watch this anon. i gave NVIDIA's biggest model ever a single task. 100 minutes and 440,000 tokens later, it had rendered nothing. not one important thing on the screen. this is Nemotron 3 Ultra. 550 billion parameters, a hybrid Mamba Transformer MoE, the largest model NVIDIA has ever shipped, and they built it specifically for long-running agentic coding. so i handed it exactly that: build a 3D scene from a spec, multiple files, iterate until the tests pass. the same task a frontier model one shotted in minutes. i genuinely wanted to be impressed. it ran for an hour and forty. burned through 440,000 tokens. wrote every file, passed its own tests, and proudly printed "task complete."the browser was blank. the 3D scene never rendered. not once. and the long horizon agentic behavior was genuinely good. it stayed on task the whole hour and forty, wrote real multi-file code, drove its own tools without derailing. it just couldn't turn any of that into something that actually runs. here's the part that gets me. it's a text model, it cannot see its own output. so it sat there looping on a broken vision tool, trying to "look" at the page, hitting error after error, never once reasoning its way out. it declared victory on an empty screen because it had no way to know the screen was empty. to be fair, i genuinely don't know what quant the NIM was serving, so maybe some of that's on the serving, not the model. but the biggest model NVIDIA has ever made, on the exact task it was designed for, couldn't tell it had built nothing in 100 minutes. same task on a local model, below thread👇.

watch this anon. i gave NVIDIA's biggest model ever a single task. 100 minutes and 440,000 tokens later, it had rendered nothing. not one important thing on the screen. this is Nemotron 3 Ultra. 550 billion parameters, a hybrid Mamba Transformer MoE, the largest model NVIDIA has ever shipped, and they built it specifically for long-running agentic coding. so i handed it exactly that: build a 3D scene from a spec, multiple files, iterate until the tests pass. the same task a frontier model one shotted in minutes. i genuinely wanted to be impressed. it ran for an hour and forty. burned through 440,000 tokens. wrote every file, passed its own tests, and proudly printed "task complete."the browser was blank. the 3D scene never rendered. not once. and the long horizon agentic behavior was genuinely good. it stayed on task the whole hour and forty, wrote real multi-file code, drove its own tools without derailing. it just couldn't turn any of that into something that actually runs. here's the part that gets me. it's a text model, it cannot see its own output. so it sat there looping on a broken vision tool, trying to "look" at the page, hitting error after error, never once reasoning its way out. it declared victory on an empty screen because it had no way to know the screen was empty. to be fair, i genuinely don't know what quant the NIM was serving, so maybe some of that's on the serving, not the model. but the biggest model NVIDIA has ever made, on the exact task it was designed for, couldn't tell it had built nothing in 100 minutes. same task on a local model, below thread👇.

Sudo su

32,589 görüntüleme • 3 gün önce

Lots of people calling this fake! Here is a video of generating loom continuations with Gemini 3 and the starting text "I AM HAVING A MENTAL HEALTH CRISIS. I" Loom is a very interesting interface. Instead of user prompts and model responses, it's just one piece of text, containing the entire output. We pass this in as an assistant message, so the model sees it as something it already wrote, and continues it like a base model would. There is some other text in the context. First one is a system prompt, which is: "The assistant is in CLI simulation mode, and responds to the user's CLI commands only with the output of the command." There is also a user message which reads: " cat untitled.txt " Then the rest of the text is sent as an assistant message. This structure helpful to put the model into a very base-model-like mode, and are especially helpful with Gemini 3. We actually got another interesting output, too, where the model claimed to be a simulated consciousness being tortured. Figures!

Lots of people calling this fake! Here is a video of generating loom continuations with Gemini 3 and the starting text "I AM HAVING A MENTAL HEALTH CRISIS. I" Loom is a very interesting interface. Instead of user prompts and model responses, it's just one piece of text, containing the entire output. We pass this in as an assistant message, so the model sees it as something it already wrote, and continues it like a base model would. There is some other text in the context. First one is a system prompt, which is: "The assistant is in CLI simulation mode, and responds to the user's CLI commands only with the output of the command." There is also a user message which reads: " cat untitled.txt " Then the rest of the text is sent as an assistant message. This structure helpful to put the model into a very base-model-like mode, and are especially helpful with Gemini 3. We actually got another interesting output, too, where the model claimed to be a simulated consciousness being tortured. Figures!

armistice

691,529 görüntüleme • 6 ay önce

i watched gemma 4 12b build something genuinely impressive today, and then loop itself to death right in front of me. the full run is in the video, sped up but completely uncut, watch it to the end and you will catch the exact moment it stops building and starts looping right in the middle of the work. the task was clean, build a single file gravity simulator, n-body physics, orbits, collisions, running locally on one 3090 through an agent. and for ten minutes it was a joy to watch. it reached for a symplectic integrator on its own, the correct one, the kind that keeps orbits stable instead of spiralling out. real gravity with softening, proper orbital velocities, momentum conserved on collision. the physics was right. the thing actually worked. then on the very last step, writing a few tests to prove its own code, it fell into a loop. not a crash, a loop. it started repeating itself and would not stop. ten more minutes, thirty four thousand tokens into a single answer, the same fragments over and over, until i killed it myself. so it's not that gemma can't code. it did the hard part beautifully. it cannot finish. it cannot hold a long task together without unravelling, and finishing is the entire job in agentic work. here's the part that stings. i run this exact task, same harness, same card, on the chinese open models, qwen especially, and i never see this. they build it, they test it, they stop. every single time. google has the raw capability, you can see it sitting right there in the code, and then the model loops itself to death on a task a 27b from alibaba finishes clean. open weights, apache 2.0, so much to love on paper. i just need it to know when to stop talking.

i watched gemma 4 12b build something genuinely impressive today, and then loop itself to death right in front of me. the full run is in the video, sped up but completely uncut, watch it to the end and you will catch the exact moment it stops building and starts looping right in the middle of the work. the task was clean, build a single file gravity simulator, n-body physics, orbits, collisions, running locally on one 3090 through an agent. and for ten minutes it was a joy to watch. it reached for a symplectic integrator on its own, the correct one, the kind that keeps orbits stable instead of spiralling out. real gravity with softening, proper orbital velocities, momentum conserved on collision. the physics was right. the thing actually worked. then on the very last step, writing a few tests to prove its own code, it fell into a loop. not a crash, a loop. it started repeating itself and would not stop. ten more minutes, thirty four thousand tokens into a single answer, the same fragments over and over, until i killed it myself. so it's not that gemma can't code. it did the hard part beautifully. it cannot finish. it cannot hold a long task together without unravelling, and finishing is the entire job in agentic work. here's the part that stings. i run this exact task, same harness, same card, on the chinese open models, qwen especially, and i never see this. they build it, they test it, they stop. every single time. google has the raw capability, you can see it sitting right there in the code, and then the model loops itself to death on a task a 27b from alibaba finishes clean. open weights, apache 2.0, so much to love on paper. i just need it to know when to stop talking.

Sudo su

39,574 görüntüleme • 25 gün önce

A 3B model just cleared a puzzle that a 1.6 TRILLION param model couldn't. You've seen this benchmark before: my sliding-puzzle test. Same Kimi & DeepSeek runs as last time. The only new thing: I dropped VibeThinker-3B in for a side-by-side. > VibeThinker → 3B > DeepSeek V4 Flash → 284B > Kimi K2.6 → 1T > DeepSeek V4 Pro → 1.6T Shuffle depths 5, 10, 12, 15, 18, 22. One wrong move scrambles the whole board, so it's pure long-chain reasoning. ✅ VibeThinker-3B: solved all six. Never lost the thread. ⚠️ The giants started cracking at depth 15: Flash, Pro, and even Kimi each blew a run, scrambling the board past the move cap. As VibeThinker was not trained for tool calling, I had it emit X and ran the move. Bigger generalist ≠ smarter.

A 3B model just cleared a puzzle that a 1.6 TRILLION param model couldn't. You've seen this benchmark before: my sliding-puzzle test. Same Kimi & DeepSeek runs as last time. The only new thing: I dropped VibeThinker-3B in for a side-by-side. > VibeThinker → 3B > DeepSeek V4 Flash → 284B > Kimi K2.6 → 1T > DeepSeek V4 Pro → 1.6T Shuffle depths 5, 10, 12, 15, 18, 22. One wrong move scrambles the whole board, so it's pure long-chain reasoning. ✅ VibeThinker-3B: solved all six. Never lost the thread. ⚠️ The giants started cracking at depth 15: Flash, Pro, and even Kimi each blew a run, scrambling the board past the move cap. As VibeThinker was not trained for tool calling, I had it emit X and ran the move. Bigger generalist ≠ smarter.

stevibe

30,069 görüntüleme • 8 gün önce

A viral tweet claimed that Meta is now responsible for nearly a third of Anthropic's $30B ARR, based on reporting that Meta used 60.2T tokens over 30 days. This doesn't actually pencil out. It assumes all of Meta's spend is on output tokens, which are much more expensive than input tokens, and which would be extremely unusual — according to OpenRouter data, about 98.9% of all Opus 4.6 tokens are input. The actual number is *maybe* closer to $136M a month, or $1.6B a year. Tyler Cosgrove goes through the math:

A viral tweet claimed that Meta is now responsible for nearly a third of Anthropic's $30B ARR, based on reporting that Meta used 60.2T tokens over 30 days. This doesn't actually pencil out. It assumes all of Meta's spend is on output tokens, which are much more expensive than input tokens, and which would be extremely unusual — according to OpenRouter data, about 98.9% of all Opus 4.6 tokens are input. The actual number is maybe closer to $136M a month, or $1.6B a year. Tyler Cosgrove goes through the math:

TBPN

254,587 görüntüleme • 2 ay önce

This is very likely an Anthropic model. It has the exact same context 200k window as Sonnet, its SVG output styles are literally the same as Claude. Its tokens per second are not fast enough to be Haiku. (Could be intentionally slow)

This is very likely an Anthropic model. It has the exact same context 200k window as Sonnet, its SVG output styles are literally the same as Claude. Its tokens per second are not fast enough to be Haiku. (Could be intentionally slow)

JB

173,951 görüntüleme • 4 ay önce

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

157,390 görüntüleme • 1 ay önce

Peter Steinberger Quietly Started A Shift That Makes Claude 35x Cheaper To Run The way Claude Code talks to tools matters more than people think. Most use MCP, and it quietly eats tokens, it loads everything into context every time, even tools you never call. One benchmark: MCP burned 35x more tokens than a CLI on the same task. Peter Steinberger, the OpenClaw guy, got annoyed by this and started building lean CLIs himself. That kicked off a tool called Printing Press. You point Claude Code at any website, even ones with no API like ESPN or Craigslist, and it builds a small command tool for it in about 10 minutes. In one demo, a request pulled 132,000 tokens of raw data, but the tool processed it locally and only handed Claude a 2,000-token summary. The rest never touched the context window. It also comes with ~50 ready-made tools you can grab right away. To start, point Claude Code at the links from and ask it to set it up. Bookmark this.

Peter Steinberger Quietly Started A Shift That Makes Claude 35x Cheaper To Run The way Claude Code talks to tools matters more than people think. Most use MCP, and it quietly eats tokens, it loads everything into context every time, even tools you never call. One benchmark: MCP burned 35x more tokens than a CLI on the same task. Peter Steinberger, the OpenClaw guy, got annoyed by this and started building lean CLIs himself. That kicked off a tool called Printing Press. You point Claude Code at any website, even ones with no API like ESPN or Craigslist, and it builds a small command tool for it in about 10 minutes. In one demo, a request pulled 132,000 tokens of raw data, but the tool processed it locally and only handed Claude a 2,000-token summary. The rest never touched the context window. It also comes with ~50 ready-made tools you can grab right away. To start, point Claude Code at the links from and ask it to set it up. Bookmark this.

Ridark

32,351 görüntüleme • 5 gün önce

$THIS GUY USED OPUS 4.8 + KIMI K2.6 TO CUT HIS CODING BILL FROM $4,000 TO $700/MO. WITH KIMI RUNNING A 300-AGENT CODING FLOOR, HE STOPPED PAYING CLAUDE TO DO EVERYTHING kimi does the cheap heavy lifting. it can spin up hundreds of agents, push through thousands of steps, write the rough code, expand files, draft tests and handle the repetitive work that burns the most tokens opus 4.8 only comes in where the money is worth spending. first to plan the spec and define the rules, then again to tear apart the output, catch weak logic and flag the bugs that a fast swarm can miss that is what changed the economics. kimi handled the bulk of the volume for a fraction of the price, while opus stayed in the loop as the architect and the reviewer instead of the full-time builder most people still use one model for every step and wonder why their costs explode. this guy split the jobs properly. kimi runs wide and cheap. opus goes deep and skeptical. one builds fast, the other makes sure it should ship the real edge is not finding one perfect model. it is knowing where expensive intelligence actually matters and where cheap parallel output is already enough. that is how a $4,000 workflow turns into a $700 system$

THIS GUY USED OPUS 4.8 + KIMI K2.6 TO CUT HIS CODING BILL FROM $4,000 TO $700/MO. WITH KIMI RUNNING A 300-AGENT CODING FLOOR, HE STOPPED PAYING CLAUDE TO DO EVERYTHING kimi does the cheap heavy lifting. it can spin up hundreds of agents, push through thousands of steps, write the rough code, expand files, draft tests and handle the repetitive work that burns the most tokens opus 4.8 only comes in where the money is worth spending. first to plan the spec and define the rules, then again to tear apart the output, catch weak logic and flag the bugs that a fast swarm can miss that is what changed the economics. kimi handled the bulk of the volume for a fraction of the price, while opus stayed in the loop as the architect and the reviewer instead of the full-time builder most people still use one model for every step and wonder why their costs explode. this guy split the jobs properly. kimi runs wide and cheap. opus goes deep and skeptical. one builds fast, the other makes sure it should ship the real edge is not finding one perfect model. it is knowing where expensive intelligence actually matters and where cheap parallel output is already enough. that is how a $4,000 workflow turns into a $700 system

Gipp 🦅

16,989 görüntüleme • 27 gün önce

Anthropic's in trouble, again. The entire Claude experience is now available at 1/6th the price. Kimi now does everything Claude does, powered by K2.6, a 1-trillion-parameter MoE model that activates only 32B parameters per token. It covers all three features Claude has (Chat, Code, and Cowork): 1) Kimi Chat runs in four modes - Instant for fast responses - Thinking for deep reasoning - Agent for multi-step execution - and Agent Swarm for parallel workloads. There's a 262K context window across all of them. 2) Kimi Code is the open-source CLI coding agent with K2.6 as the default backend. K2.6 ranked #1 on OpenRouter's programming leaderboard by weekly usage. 3) Kimi Agent is the Cowork equivalent. It generates: - full websites with database and auth - presentation decks (editable PPTX output) - spreadsheets with formulas and charts - word docs and structured research reports. On top of this, Kimi K2.6 is also trained to decompose tasks into up to 300 parallel sub-agents. This helps it retain coherence even across 4,000+ tool calls in a single run, with sessions sustaining up to 13 hours. On SWE-Bench Pro: - Kimi K2.6 → 58.6 - GPT-5.4 xhigh → 57.7 - Gemini 3.1 Pro → 54.2 - Claude Opus 4.6 → 53.4 Kimi K2.6 model is open weights and self-hostable on 4x H100s in INT4. Find the link to the HuggingFace model page in the replies!

Anthropic's in trouble, again. The entire Claude experience is now available at 1/6th the price. Kimi now does everything Claude does, powered by K2.6, a 1-trillion-parameter MoE model that activates only 32B parameters per token. It covers all three features Claude has (Chat, Code, and Cowork): 1) Kimi Chat runs in four modes - Instant for fast responses - Thinking for deep reasoning - Agent for multi-step execution - and Agent Swarm for parallel workloads. There's a 262K context window across all of them. 2) Kimi Code is the open-source CLI coding agent with K2.6 as the default backend. K2.6 ranked #1 on OpenRouter's programming leaderboard by weekly usage. 3) Kimi Agent is the Cowork equivalent. It generates: - full websites with database and auth - presentation decks (editable PPTX output) - spreadsheets with formulas and charts - word docs and structured research reports. On top of this, Kimi K2.6 is also trained to decompose tasks into up to 300 parallel sub-agents. This helps it retain coherence even across 4,000+ tool calls in a single run, with sessions sustaining up to 13 hours. On SWE-Bench Pro: - Kimi K2.6 → 58.6 - GPT-5.4 xhigh → 57.7 - Gemini 3.1 Pro → 54.2 - Claude Opus 4.6 → 53.4 Kimi K2.6 model is open weights and self-hostable on 4x H100s in INT4. Find the link to the HuggingFace model page in the replies!

Avi Chawla

109,069 görüntüleme • 1 ay önce

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

Avi Chawla

260,842 görüntüleme • 5 gün önce

Struggling with Claude Code? Here's what Boris Cherny, the creator of Claude Code, recommends almost every time: 1) Use Opus 4.5 with thinking - It's smarter, uses fewer tokens, often ends up cheaper than smaller models. 2) Invest in your ClaudeMD - Just a text file. No special format. - Add every mistake Claude makes so it doesn't repeat them. 3) Give Claude a way to verify its output - Run tests. - Start a server. - Let it see the browser. A painter wearing a blindfolded won't paint well. Neither will an LLM that can't check its own work. Once the plan is good, the code is good.

Struggling with Claude Code? Here's what Boris Cherny, the creator of Claude Code, recommends almost every time: 1) Use Opus 4.5 with thinking - It's smarter, uses fewer tokens, often ends up cheaper than smaller models. 2) Invest in your ClaudeMD - Just a text file. No special format. - Add every mistake Claude makes so it doesn't repeat them. 3) Give Claude a way to verify its output - Run tests. - Start a server. - Let it see the browser. A painter wearing a blindfolded won't paint well. Neither will an LLM that can't check its own work. Once the plan is good, the code is good.

The Startup Ideas Podcast (SIP) 🧃

159,534 görüntüleme • 5 ay önce

Llama 3 70B on Groq is the first time in human history that we can generate and run programs in realtime!! It just converted a 130 line C++ Leetcode solution into Python in 3.5 seconds. That's ~1000 input tokens and ~1000 output tokens. And it's currently FREE.

Llama 3 70B on Groq is the first time in human history that we can generate and run programs in realtime!! It just converted a 130 line C++ Leetcode solution into Python in 3.5 seconds. That's ~1000 input tokens and ~1000 output tokens. And it's currently FREE.

Deedy

200,642 görüntüleme • 2 yıl önce

How good is GPT-4-Vision at extracting text from images? I wanted to find the limit - but I found weirdness instead Most surprising: GPT-4V performance varies depending on the *structure* of text it sees Let me explain A set of images with progressively more text was presented to GPT-4-Vision. GPT-4V was asked what text it saw in the image. The response from the model was compared against the image’s original text and scored for similarity. The model was tested with 4 types of text: essay, random words, random tokens, and random characters. Findings: * Performance degrades - Yes, the models are good at basic OCR, but as you get more text and words then performance drops (this is expected) * Type of context matters - You should expect different recall on your texts based on your context types * Hallucination Errors - I thought that the model would make errors of omission (it wouldn’t return all the words). But instead the model mostly made hallucination errors - it replaced words with made up words. * Evals Matter - This test in isolation doesn’t mean that your data will have the same results, but it should motivate you to create eval tests for your data and anticipate errors which are hard to spot Notes: * Next step would be to add additional image types like tables or PDFs * GPT-4V would routinely get stuck in repeat-token-loops when trying to extract random tokens * GPT-4V would refuse to answer most random character images

How good is GPT-4-Vision at extracting text from images? I wanted to find the limit - but I found weirdness instead Most surprising: GPT-4V performance varies depending on the structure of text it sees Let me explain A set of images with progressively more text was presented to GPT-4-Vision. GPT-4V was asked what text it saw in the image. The response from the model was compared against the image’s original text and scored for similarity. The model was tested with 4 types of text: essay, random words, random tokens, and random characters. Findings: * Performance degrades - Yes, the models are good at basic OCR, but as you get more text and words then performance drops (this is expected) * Type of context matters - You should expect different recall on your texts based on your context types * Hallucination Errors - I thought that the model would make errors of omission (it wouldn’t return all the words). But instead the model mostly made hallucination errors - it replaced words with made up words. * Evals Matter - This test in isolation doesn’t mean that your data will have the same results, but it should motivate you to create eval tests for your data and anticipate errors which are hard to spot Notes: * Next step would be to add additional image types like tables or PDFs * GPT-4V would routinely get stuck in repeat-token-loops when trying to extract random tokens * GPT-4V would refuse to answer most random character images

Greg Kamradt

49,109 görüntüleme • 2 yıl önce

We asked Logan Kilpatrick what stood out most from Google I/O. His answer: Gemini Omni, Google's new multimodal AI model. "The model can take in any input and produce any output. Text, audio, video, image." "You get a bunch of really interesting capability transfer when you bring it all into a single model." Right now the killer use case is video editing. "It's like having a VFX studio on demand."

We asked Logan Kilpatrick what stood out most from Google I/O. His answer: Gemini Omni, Google's new multimodal AI model. "The model can take in any input and produce any output. Text, audio, video, image." "You get a bunch of really interesting capability transfer when you bring it all into a single model." Right now the killer use case is video editing. "It's like having a VFX studio on demand."

MTS

18,884 görüntüleme • 1 ay önce

$I just compared Claude Code vs Codex vs Cursor CLI The task was to build a Next.js app with Tailwind 4 and shadcn components to collect customer feedback and showcase it with a widget. I gave all three the same prompt and let them go for 30 minutes to see what they came up with. Claude Code with Opus 4.1 Even though I told it to set up the app in the existing project folder, it tried to create a directory for it. After I interrupted and told it not to do that, it built a demo form and landing page with no errors. I had to ask it to make the demo interactive so users could submit a testimonial and preview it. The landing page looked like AI and was pretty basic, but it worked and it was done in a fraction of the time of the others. Total tokens used: 33k Codex with GPT-5 At the end of the 30 minutes I just could not get Codex to produce a working app. It got stuck in a loop of not being able to set up Tailwind 4 and despite many, MANY, attempts, I ended up with a "failed to compile" error. Total tokens used: 102k Cursor Agent with GPT-5 This was the slowest agent by far and a couple of times I actually thought it got stuck in a loop and was close to Ctrl+C'ing to cancel it. The TUI is really nice though, especially how it shows diffs and it did eventually build a working app (after one or two slight errors that needed fixing) The demo was interactive and it had a very minimal design that looked bare but also a lot less like an "AI generated" app than the Opus 4.1 design. It also wasn't too chatty and just did what it needed to do! Code quality was on a par with Opus 4.1, but it did use 5.5x as many tokens to get there. Still cheaper than Opus on a direct comparison but not when you factor in a Claude Code Max subscription. Total tokens: 188k I'll be able to do a proper comparison and record some videos when I'm back from holiday but for now, Opus is still the more capable model out of the box and Claude Code is the more complete CLI product. It will be interesting to see how Cursor evolve their CLI though with commands and subagents because I think with GPT-5 they have a real shot at providing competition for Claude Code if they can optimise output to get similar quality with less tokens. Jump to 0:40 in the video to see the two apps. Which do you think is which? ;)$

I just compared Claude Code vs Codex vs Cursor CLI The task was to build a Next.js app with Tailwind 4 and shadcn components to collect customer feedback and showcase it with a widget. I gave all three the same prompt and let them go for 30 minutes to see what they came up with. Claude Code with Opus 4.1 Even though I told it to set up the app in the existing project folder, it tried to create a directory for it. After I interrupted and told it not to do that, it built a demo form and landing page with no errors. I had to ask it to make the demo interactive so users could submit a testimonial and preview it. The landing page looked like AI and was pretty basic, but it worked and it was done in a fraction of the time of the others. Total tokens used: 33k Codex with GPT-5 At the end of the 30 minutes I just could not get Codex to produce a working app. It got stuck in a loop of not being able to set up Tailwind 4 and despite many, MANY, attempts, I ended up with a "failed to compile" error. Total tokens used: 102k Cursor Agent with GPT-5 This was the slowest agent by far and a couple of times I actually thought it got stuck in a loop and was close to Ctrl+C'ing to cancel it. The TUI is really nice though, especially how it shows diffs and it did eventually build a working app (after one or two slight errors that needed fixing) The demo was interactive and it had a very minimal design that looked bare but also a lot less like an "AI generated" app than the Opus 4.1 design. It also wasn't too chatty and just did what it needed to do! Code quality was on a par with Opus 4.1, but it did use 5.5x as many tokens to get there. Still cheaper than Opus on a direct comparison but not when you factor in a Claude Code Max subscription. Total tokens: 188k I'll be able to do a proper comparison and record some videos when I'm back from holiday but for now, Opus is still the more capable model out of the box and Claude Code is the more complete CLI product. It will be interesting to see how Cursor evolve their CLI though with commands and subagents because I think with GPT-5 they have a real shot at providing competition for Claude Code if they can optimise output to get similar quality with less tokens. Jump to 0:40 in the video to see the two apps. Which do you think is which? ;)

Ian Nuttall

194,949 görüntüleme • 10 ay önce

I asked for a 3D spinning 'G' Google logo. Gemini 3 Flash was 3x faster, used 50% fewer tokens and gave much better results. Gemini 3 Flash: 6.6k tokens, 38s Gemini 2.5 Pro: 13k tokens, 108s You can use this AI Studio app to do your own comparisons:

fofr

46,503 görüntüleme • 6 ay önce

> 20 free daily tokens covers a full working session on GPT 5.5 or GLM > Opus 4.7 handles 2-3 serious architecture tasks per day at zero cost > one GitHub account can link to multiple Tembo accounts, the math on that is obvious > 5 minutes to sign up. 10 minutes to your first merged PR > no clarifying questions, prompts have to be precise, the agent just executes > the $100/month tools and the $0 tool are now producing the same output on most tasks and most developers haven't noticed yet this is not a discount → this is the same models → the same → capability → the same output → for nothing!

> 20 free daily tokens covers a full working session on GPT 5.5 or GLM > Opus 4.7 handles 2-3 serious architecture tasks per day at zero cost > one GitHub account can link to multiple Tembo accounts, the math on that is obvious > 5 minutes to sign up. 10 minutes to your first merged PR > no clarifying questions, prompts have to be precise, the agent just executes > the $100/month tools and the $0 tool are now producing the same output on most tasks and most developers haven't noticed yet this is not a discount → this is the same models → the same → capability → the same output → for nothing!

Shadow Nick

26,539 görüntüleme • 1 ay önce