正在加载视频...

视频加载失败

MiniMax M3: Opus-level coding at DeepSeek pricing. On Terminal-Bench 2.1, it scores 66.0, only 0.1 behind Opus 4.7. I gave it a quick try on a few frontend tasks, and the output quality genuinely feels close to Opus 4.7.

25,648 次观看 • 16 天前 •via X (Twitter)

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

BREAKING: Anthropic just dropped Opus 4.8—and it is a MONSTER We've been testing for about a week Every 📧 and our verdict is they could've just called it Opus 5, it's that good. Here's our vibe check: - Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works. HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results. - Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context. HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high. - Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark. - Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic. THE BAD: These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude. Anthropic is back baby! Read the rest on Every 📧:

Dan Shipper 📧

350,783 次观看 • 20 天前

An OpenAI engineer stopped me at a hackathon in Hayes Valley I had my terminal open on a table. Three panels. Live trades scrolling. He was walking past and froze. "That's not a demo. That's a live scoring engine. What model is that" I told him. Claude Opus 4.7. Four repos. $25 a month. He pulled up a chair without asking. "We benchmarked Opus 4.7 internally. It beat o3 on structured reasoning across every eval we ran. And you're telling me you're using it to trade" I told him it does more than trade. It reads 86 million trades and finds who wins and why. No fine-tuning. No prompting chains. Just raw context. He leaned back. "Show me the data source" I opened one link. 86 million trades. Every wallet. Every entry. Every exit. "You point Opus 4.7 at this and it reverse-engineers the strategy. It finds the wallets that win. Then it finds why they win. Then it copies the pattern" His team spent 14 months building something similar. 10 engineers. Custom infra. Still in staging. "The part that killed us was exit timing. Every model we trained nailed entries. But the best traders exit before the crowd. We never figured out the threshold" I told him my bot cuts at 85% of expected move. Or on a 3x volume spike. Whichever comes first. He stopped talking. "How did you find that" Opus 4.7 found it in poly_data. Top wallets exit before resolution 86% of the time. Losers hold to 58%. Exits are the entire game. I opened another tab. "Three commands. 500 markets. Opus scores them in 20 minutes" "That's our internal eval pipeline. Except it took us a year and you did it in a weekend with our competitor's model" My setup: Claude Opus 4.7 - $20/mo VPS - $5/mo poly_data - free polymarket-cli - free 214 trades. 74% win rate. +$9,400 in 19 days. Copytrade here: I showed him the article where I broke down every repo and every command. He read it twice. Then looked up. "You just published what we've been trying to ship for six months. Using the other team's model" He texted me the next day. "My manager found your thread. Delete it" Too late.

Lunar

136,409 次观看 • 2 个月前