正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

DeepSeek-V4 dropped. 1M context. 10x smaller KV cache. First open model where the context window and the agentic post-training meet.

Ben Burtenshaw

7,563 subscribers

49,900 次观看 • 3 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

DeepSeek V4 ships with two variants allowing it to hit 1M Context. Kimbo @ ICML26 breaks down the attention changes and the Mega MOE speeding up compute.

DeepSeek V4 ships with two variants allowing it to hit 1M Context. Kimbo @ ICML26 breaks down the attention changes and the Mega MOE speeding up compute.

SemiAnalysis

39,693 次观看 • 26 天前

Qwen just quietly dropped the most dangerous open model of 2026 Qwen3.6-Plus. native 1M context. agentic workflows. multimodal. half the size of competitors. and it's free on OpenRouter right now. you're not ready for this breakdown 🧵

Qwen just quietly dropped the most dangerous open model of 2026 Qwen3.6-Plus. native 1M context. agentic workflows. multimodal. half the size of competitors. and it's free on OpenRouter right now. you're not ready for this breakdown 🧵

Farhan

53,820 次观看 • 4 个月前

THIS MIGHT BE THE OPPENHEIMER MOMENT FOR OPEN-SOURCE AI. Kimi K3 packs 2.8T parameters, a 1M context window, native multimodality, and serious agentic coding performance. made with kimi k3

THIS MIGHT BE THE OPPENHEIMER MOMENT FOR OPEN-SOURCE AI. Kimi K3 packs 2.8T parameters, a 1M context window, native multimodality, and serious agentic coding performance. made with kimi k3

Farhan

14,099 次观看 • 16 天前

Chinese labs aren't just dropping model weights anymore, they're building full ecosystems. Z.ai's ZCode + GLM 5.2 = native IDE, 1M context, agentic workflows, and now dSpark speculative decoding. The open model platform war is here. 👇

Chinese labs aren't just dropping model weights anymore, they're building full ecosystems. Z.ai's ZCode + GLM 5.2 = native IDE, 1M context, agentic workflows, and now dSpark speculative decoding. The open model platform war is here. 👇

Boxmining

14,831 次观看 • 1 个月前

Qwen 3.6 Plus Preview just dropped on OpenRouter for Free. > 1M token context. > $0 input. $0 output. Meanwhile: - Claude Opus 4.6 → $5 / $25 per 1M - GPT-5.4 → charges for 1M context Qwen just nuked the pricing model. Open source isn’t slowing down. It’s taking over.

Qwen 3.6 Plus Preview just dropped on OpenRouter for Free. > 1M token context. > $0 input. $0 output. Meanwhile: - Claude Opus 4.6 → $5 / $25 per 1M - GPT-5.4 → charges for 1M context Qwen just nuked the pricing model. Open source isn’t slowing down. It’s taking over.

WorldofAI

146,277 次观看 • 4 个月前

Chinese labs are now competing with each other to build the strongest open-weight model. > DeepSeek started the domino effect > Qwen kept improving across coding, reasoning and long context > GLM, MiniMax and Moonshot joined the race > Kimi K3 pushed it to 2.8T parameters and a 1M context window > now Alibaba is answering with Qwen3.8 > 2.4T parameters > full open-weight release coming soon > Alibaba says it is second only to Fable 5 Qwen3.8 may be the model that brings open-weight AI closest to Claude Fable 5.

Chinese labs are now competing with each other to build the strongest open-weight model. > DeepSeek started the domino effect > Qwen kept improving across coding, reasoning and long context > GLM, MiniMax and Moonshot joined the race > Kimi K3 pushed it to 2.8T parameters and a 1M context window > now Alibaba is answering with Qwen3.8 > 2.4T parameters > full open-weight release coming soon > Alibaba says it is second only to Fable 5 Qwen3.8 may be the model that brings open-weight AI closest to Claude Fable 5.

Tornado guy

64,960 次观看 • 14 天前

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on Fireworks AI inference. This is the first time I feel like there is an open-weight model that can reason at the level of Claude and Codex. And it does this in a cost-effective way with support for 1M context length. To be clear, I am using DeepSeek-V4-Pro inside of Pi without any special configuration. It works out of the box. It's exciting that there is a model that can just be plugged into a basic harness like Pi, and it just works. I've never seen that before. Most models require lots of configuration and setup. DeepSeek's DeepSeek-V4-Pro is clearly good at agentic coding (probably the best from the open-weight models), but the model is also great on knowledge-intensive tasks where reasoning matters. The agent pulled agentic engineering best practices from different company docs (Anthropic, OpenAI, Google, Stripe, Meta, Modal, DeepSeek, Mistral, Cohere), searched and digested Reddit and HN threads, summarized arxiv papers, and surfaced trending GitHub repos. Then it distilled everything into actionable tips across categories. I love the Wiki it built. The quality is really good. Here is a snapshot of what the wiki looks like: DeepSeek-V4-Pro handled the task without breaking stride. Multi-step research queries, code generation for scaffolding, context-heavy reasoning across disparate sources. For coding specifically, this is the first open-weight model that genuinely feels like a Codex or Claude Code experience. It compares in capability and actual multi-turn agentic work. What made the loop feel so responsive was Fireworks' inference speed (the fastest in the market) and the fact that they actually validate models at the systems level before shipping. No corrupted reasoning traces. Just fast, reliable iteration. The hybrid CSA and HCA attention design cuts KV cache to just 10% and inference FLOPs by nearly 4x at 1M-token context. This is what makes the agent loop actually fast and cheap enough to run in practice. For devs who've been watching open-weight models close the gap but haven't found one that actually delivers in practice, this is the closest I've seen. Try it here:

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on Fireworks AI inference. This is the first time I feel like there is an open-weight model that can reason at the level of Claude and Codex. And it does this in a cost-effective way with support for 1M context length. To be clear, I am using DeepSeek-V4-Pro inside of Pi without any special configuration. It works out of the box. It's exciting that there is a model that can just be plugged into a basic harness like Pi, and it just works. I've never seen that before. Most models require lots of configuration and setup. DeepSeek's DeepSeek-V4-Pro is clearly good at agentic coding (probably the best from the open-weight models), but the model is also great on knowledge-intensive tasks where reasoning matters. The agent pulled agentic engineering best practices from different company docs (Anthropic, OpenAI, Google, Stripe, Meta, Modal, DeepSeek, Mistral, Cohere), searched and digested Reddit and HN threads, summarized arxiv papers, and surfaced trending GitHub repos. Then it distilled everything into actionable tips across categories. I love the Wiki it built. The quality is really good. Here is a snapshot of what the wiki looks like: DeepSeek-V4-Pro handled the task without breaking stride. Multi-step research queries, code generation for scaffolding, context-heavy reasoning across disparate sources. For coding specifically, this is the first open-weight model that genuinely feels like a Codex or Claude Code experience. It compares in capability and actual multi-turn agentic work. What made the loop feel so responsive was Fireworks' inference speed (the fastest in the market) and the fact that they actually validate models at the systems level before shipping. No corrupted reasoning traces. Just fast, reliable iteration. The hybrid CSA and HCA attention design cuts KV cache to just 10% and inference FLOPs by nearly 4x at 1M-token context. This is what makes the agent loop actually fast and cheap enough to run in practice. For devs who've been watching open-weight models close the gap but haven't found one that actually delivers in practice, this is the closest I've seen. Try it here:

elvis

59,803 次观看 • 3 个月前

CHINA JUST DROPPED AN AI CODING MODEL WITH A 1M CONTEXT WINDOW. And I connected it to Claude Code to see what it could actually do. Meet GLM-X Preview On paper, a few things immediately stood out: → 1M context window → Agentic coding capabilities → Works inside Claude Code → Designed for large-scale coding and reasoning workflows But specs don't matter much if the model can't deliver in practice. So I gave it a real-world task. THE TEST One prompt: > Build a modern AI lead generation dashboard using React and Tailwind CSS. Requirements: → Dark mode → Analytics dashboard → Lead table → Email outreach section → Responsive design → Production-ready component structure Instead of generating a few snippets, it planned the architecture, generated the dashboard components, created the Tailwind configuration, and walked through the implementation requirements. What impressed me most wasn't the code itself. It was how well it maintained context throughout the workflow. That's where a 1M context window starts becoming useful. Less time re-explaining requirements. Less context loss. More room for complex projects. The AI coding race is getting very interesting. And it's no longer just GPT, Claude, and Gemini competing for attention. Results from my test below 👇

CHINA JUST DROPPED AN AI CODING MODEL WITH A 1M CONTEXT WINDOW. And I connected it to Claude Code to see what it could actually do. Meet GLM-X Preview On paper, a few things immediately stood out: → 1M context window → Agentic coding capabilities → Works inside Claude Code → Designed for large-scale coding and reasoning workflows But specs don't matter much if the model can't deliver in practice. So I gave it a real-world task. THE TEST One prompt: > Build a modern AI lead generation dashboard using React and Tailwind CSS. Requirements: → Dark mode → Analytics dashboard → Lead table → Email outreach section → Responsive design → Production-ready component structure Instead of generating a few snippets, it planned the architecture, generated the dashboard components, created the Tailwind configuration, and walked through the implementation requirements. What impressed me most wasn't the code itself. It was how well it maintained context throughout the workflow. That's where a 1M context window starts becoming useful. Less time re-explaining requirements. Less context loss. More room for complex projects. The AI coding race is getting very interesting. And it's no longer just GPT, Claude, and Gemini competing for attention. Results from my test below 👇

Md Riyazuddin

31,199 次观看 • 1 个月前

OpenAI recently released its first open-weights model since GPT-2, entering a field led by DeepSeek and Alibaba's Qwen. Ankit () breaks down these top OSS models, including what sets them apart under the hood: mixture-of-experts, long-context training, and post-training techniques that shape reasoning and alignment—and how different design choices lead to surprisingly similar performance. 00:00 – OpenAI OSS Launch 01:00 – Comparing Open Source LLM Architectures 01:46 – GPT OSS Overview 02:37 – Under The Hood of GPT OSS 03:25 – Qwen-3 Architecture 04:17 – Qwen-3 Training 05:12 – Qwen-3 Post-Training 06:08 – Qwen-3 Reasoning & RL Innovations 06:52 – DeepSeek V3 Overview 07:40 – DeepSeek V3.1 Updates 08:39 – Attention Mechanism (MLA) 09:39 – Comparing Model Sizes 10:35 – Long Context Strategies 11:25 – Reflections on Methods 12:00 – Takeaways

OpenAI recently released its first open-weights model since GPT-2, entering a field led by DeepSeek and Alibaba's Qwen. Ankit () breaks down these top OSS models, including what sets them apart under the hood: mixture-of-experts, long-context training, and post-training techniques that shape reasoning and alignment—and how different design choices lead to surprisingly similar performance. 00:00 – OpenAI OSS Launch 01:00 – Comparing Open Source LLM Architectures 01:46 – GPT OSS Overview 02:37 – Under The Hood of GPT OSS 03:25 – Qwen-3 Architecture 04:17 – Qwen-3 Training 05:12 – Qwen-3 Post-Training 06:08 – Qwen-3 Reasoning & RL Innovations 06:52 – DeepSeek V3 Overview 07:40 – DeepSeek V3.1 Updates 08:39 – Attention Mechanism (MLA) 09:39 – Comparing Model Sizes 10:35 – Long Context Strategies 11:25 – Reflections on Methods 12:00 – Takeaways

Y Combinator

208,680 次观看 • 11 个月前

$I just crammed the updated Gemma 4 26B A4B QAT (MoE) with 180k context into an 8GB RTX 4060 (8 GB VRAM + 16 GB RAM only!!) and optimized the batch size. 23 tokens/sec decode, 300 tokens/sec prefill Yesterday I showed you a Gemma 4 31B dense model running flawlessly on an RTX 4090. Today, we're breaking the VRAM bank on a budget card using Unsloth’s new Gemma 4 26B (A4B) QAT quants. Following Google’s chat template update that boosted agentic benchmarks by +10%, I pushed this model to its absolute limits. Here is how you squeeze 250k context out of 8GB of VRAM. # The Setup & The Optimization - Hardware: Nvidia RTX 4060 (8GB VRAM) + 16GB System RAM - Environment: CUDA 13.0 build of llama.cpp - Model: gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf - Prompt: 28,000 tokens of prompt for each run If you read my L2 cache breakdown (attached in replies), you know the 4060’s 24MB cache maxes out at `-b 1024 -ub 1024`. Push past that, and prefill crashes. I locked those flags in for every test below to ensure maximum GEMM throughput. # 1. The Raw Context Push (Unquantized KV Cache) First, I wanted to see how far pure 8GB VRAM + 16GB RAM could stretch without touching the KV cache: - 80k Context: Prefill 385 t/s | Decode 25.5 t/s - 120k Context: Prefill 270 t/s | Decode 24 t/s llama.cpp flags: .\llama-server -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 120000 --port 8080 -ub 1024 -b 1024 Without KV quantization, 120k is your hard ceiling. push past that prefill throughput drops off a cliff, making the model practically unusable for large agentic workloads. # 2. The Q8 KV Cache Lifeline To survive 250k context on a budget card, you have to quantize the KV cache. I enabled 8 bit KV cache (`-ctk q8_0 -ctv q8_0`) and re ran: - 180k Context: Prefill 280 t/s | Decode 22.8 t/s - 250k Context: Prefill 115 t/s | Decode 20 t/s llama.cpp flags: .\llama-server -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 180000 --port 8080 -b 1024 -ub 1024 -ctk q8_0 -ctv q8_0 Result: Q8 KV cache brings 250k context back from the dead. Decode speed stabilizes at a highly usable 20 t/s. You are trading a very small bit amount of reasoning precision for an extra 130,000 tokens of context window. if you own a single rtx 3050, 3060, 3070, 4050, 4060, 5050 or 5060, you must try this model and optimize your batch size for higher prefill. Hugging Face links to the updated Unsloth's QAT quants and performance graph are in the replies below. What model are you running on your 6GB, 8GB or 12GB cards right now? Let's see your setups.$

I just crammed the updated Gemma 4 26B A4B QAT (MoE) with 180k context into an 8GB RTX 4060 (8 GB VRAM + 16 GB RAM only!!) and optimized the batch size. 23 tokens/sec decode, 300 tokens/sec prefill Yesterday I showed you a Gemma 4 31B dense model running flawlessly on an RTX 4090. Today, we're breaking the VRAM bank on a budget card using Unsloth’s new Gemma 4 26B (A4B) QAT quants. Following Google’s chat template update that boosted agentic benchmarks by +10%, I pushed this model to its absolute limits. Here is how you squeeze 250k context out of 8GB of VRAM. # The Setup & The Optimization - Hardware: Nvidia RTX 4060 (8GB VRAM) + 16GB System RAM - Environment: CUDA 13.0 build of llama.cpp - Model: gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf - Prompt: 28,000 tokens of prompt for each run If you read my L2 cache breakdown (attached in replies), you know the 4060’s 24MB cache maxes out at `-b 1024 -ub 1024`. Push past that, and prefill crashes. I locked those flags in for every test below to ensure maximum GEMM throughput. # 1. The Raw Context Push (Unquantized KV Cache) First, I wanted to see how far pure 8GB VRAM + 16GB RAM could stretch without touching the KV cache: - 80k Context: Prefill 385 t/s | Decode 25.5 t/s - 120k Context: Prefill 270 t/s | Decode 24 t/s llama.cpp flags: .\llama-server -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 120000 --port 8080 -ub 1024 -b 1024 Without KV quantization, 120k is your hard ceiling. push past that prefill throughput drops off a cliff, making the model practically unusable for large agentic workloads. # 2. The Q8 KV Cache Lifeline To survive 250k context on a budget card, you have to quantize the KV cache. I enabled 8 bit KV cache (`-ctk q8_0 -ctv q8_0`) and re ran: - 180k Context: Prefill 280 t/s | Decode 22.8 t/s - 250k Context: Prefill 115 t/s | Decode 20 t/s llama.cpp flags: .\llama-server -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 180000 --port 8080 -b 1024 -ub 1024 -ctk q8_0 -ctv q8_0 Result: Q8 KV cache brings 250k context back from the dead. Decode speed stabilizes at a highly usable 20 t/s. You are trading a very small bit amount of reasoning precision for an extra 130,000 tokens of context window. if you own a single rtx 3050, 3060, 3070, 4050, 4060, 5050 or 5060, you must try this model and optimize your batch size for higher prefill. Hugging Face links to the updated Unsloth's QAT quants and performance graph are in the replies below. What model are you running on your 6GB, 8GB or 12GB cards right now? Let's see your setups.

Alok

36,389 次观看 • 10 天前

MIT Researchers destroyed context window limits. 10m+ token prompts are now possible by moving context out of the model and into code environments. Full breakdown below.

MIT Researchers destroyed context window limits. 10m+ token prompts are now possible by moving context out of the model and into code environments. Full breakdown below.

Matthew Berman

287,626 次观看 • 6 个月前

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

vLLM

40,306 次观看 • 1 个月前

Qwen just published the 'Thinking' variant of this model 🔥 So you can run a model EVEN MORE powerful than GPT-4o locally! - Still only 3B active parameters - Open source license - 256k context window extendable to 1M - Strong in math, science and coding Details below ↓

Qwen just published the 'Thinking' variant of this model 🔥 So you can run a model EVEN MORE powerful than GPT-4o locally! - Still only 3B active parameters - Open source license - 256k context window extendable to 1M - Strong in math, science and coding Details below ↓

Paul Couvert

106,261 次观看 • 1 年前

This is Claude Sonnet 4.6: our most capable Sonnet model yet. It’s a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It also features a 1M token context window in beta.

This is Claude Sonnet 4.6: our most capable Sonnet model yet. It’s a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It also features a 1M token context window in beta.

Claude

7,598,093 次观看 • 5 个月前

China's Qafind Labs just launched the first Diffusion Language Model (DLM). 🔥 ChatDLM, described as the first Diffusion Language Model (DLM), will be open-sourced soon. - Inference Speed: 2,800 tokens/sec (on A100), which is insanely fast. - Context Window: 131,072 tokens. More details 👇

China's Qafind Labs just launched the first Diffusion Language Model (DLM). 🔥 ChatDLM, described as the first Diffusion Language Model (DLM), will be open-sourced soon. - Inference Speed: 2,800 tokens/sec (on A100), which is insanely fast. - Context Window: 131,072 tokens. More details 👇

AshutoshShrivastava

73,296 次观看 • 1 年前

Introducing Claude Opus 4.6. Our smartest model got an upgrade. Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes. It’s also our first Opus-class model with 1M token context in beta.

Introducing Claude Opus 4.6. Our smartest model got an upgrade. Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes. It’s also our first Opus-class model with 1M token context in beta.

Claude

10,604,209 次观看 • 5 个月前

i just beat Google DeepMind's turboquant introducing Shard. 10x KV cache compression on Llama-3.1-8B. zero quality loss - 10x @ 8K context, 11.2x @ 32K - NIAH recall 1.000 across 4K-32K - LongBench Δ ≈ 0 vs FP16 turboquant tops out at 4-6x at the same quality. we doubled it. read more: Kirri

i just beat Google DeepMind's turboquant introducing Shard. 10x KV cache compression on Llama-3.1-8B. zero quality loss - 10x @ 8K context, 11.2x @ 32K - NIAH recall 1.000 across 4K-32K - LongBench Δ ≈ 0 vs FP16 turboquant tops out at 4-6x at the same quality. we doubled it. read more: Kirri

Krish

155,670 次观看 • 2 个月前

We were super stoked to see Anthropic launch the Model Context Protocol (MCP) this week. 🫡 Today, we’re rolling out the first version of our Cloudflare MCP server with initial support for R2, D1, KV, and Workers.

We were super stoked to see Anthropic launch the Model Context Protocol (MCP) this week. 🫡 Today, we’re rolling out the first version of our Cloudflare MCP server with initial support for R2, D1, KV, and Workers.

Cloudflare Developers

122,993 次观看 • 1 年前

Introducing Claude Code Hook - Context Timeline (Saving this to try later) Install with: npx claude-code-templates@latest --hook monitoring/context-timeline Managing the context window and the subagents running in Claude Code is hard to keep track of That's why I built this hook... It starts the moment you open a session and shows a timeline with the main agent's context window and how subagents start working in their own separate context Every subagent you have running will show up in real time This way you can manage the context and the subagents you run, and see everything in a much simpler way than in the console

Introducing Claude Code Hook - Context Timeline (Saving this to try later) Install with: npx claude-code-templates@latest --hook monitoring/context-timeline Managing the context window and the subagents running in Claude Code is hard to keep track of That's why I built this hook... It starts the moment you open a session and shows a timeline with the main agent's context window and how subagents start working in their own separate context Every subagent you have running will show up in real time This way you can manage the context and the subagents you run, and see everything in a much simpler way than in the console

Daniel San

51,706 次观看 • 3 个月前

Perplexity CEO on China catching up in AI: “Whatever you did to not let them catch up didn’t even matter. They ended up catching up anyway. What’s more dangerous is they have the best open-source model. And all the American developers are building on that.” That was DeepSeek. Now just dropped GLM-5.2: • MIT open weights • 1M context • 81.0 on Terminal-Bench 2.1 • within a few points of Claude Opus 4.8 The open-source AI race is not theory anymore.

Perplexity CEO on China catching up in AI: “Whatever you did to not let them catch up didn’t even matter. They ended up catching up anyway. What’s more dangerous is they have the best open-source model. And all the American developers are building on that.” That was DeepSeek. Now just dropped GLM-5.2: • MIT open weights • 1M context • 81.0 on Terminal-Bench 2.1 • within a few points of Claude Opus 4.8 The open-source AI race is not theory anymore.

Vadim

650,826 次观看 • 1 个月前