正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Three Cline agents. Same 120B model. One rule: terminate your opponents' processes before they terminate yours. We tested raw inference across an NVIDIA AI PC DGX Spark, an RTX 4090, and cloud. Time-to-first-token decides the race. Throughput decides who ships. DGX Spark: 42.9 tok/s. RTX 4090: 8.7. Same weights.... show more

Cline

60,435 subscribers

14,782 次观看 • 3 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

stealth model update: 5x more context `code-supernova-1-million` (access via cline provider) > 1 million token context window (up from 200k) > same multi-modal support > same free access during alpha <the supernova is intensifying>

stealth model update: 5x more context `code-supernova-1-million` (access via cline provider) > 1 million token context window (up from 200k) > same multi-modal support > same free access during alpha <the supernova is intensifying>

Cline

33,244 次观看 • 10 个月前

1. Model agnostic. 2. Inference agnostic. & now, 3. Platform agnostic. Cline for JetBrains is here. (install it below)

1. Model agnostic. 2. Inference agnostic. & now, 3. Platform agnostic. Cline for JetBrains is here. (install it below)

Cline

72,935 次观看 • 10 个月前

i spent $26,600 on cloud GPU rentals over 14 months before i found a NVIDIA DGX Spark at $2,999 (founder's edition) or $3,999 (shipping price) it paid for itself in 6 weeks i run 200B parameter models locally now and my old cloud provider keeps sending me loyalty discount emails the math on that $26,600 is embarrassing to type out loud $1,900/month for 14 months, H100 instances on a specialist cloud provider, because anything bigger than a 70B model simply would not fit anywhere else i paid the invoices like they were a utility bill and told myself it was just the cost of doing serious AI work it took me over a year to find out it wasn't 14 months, broken down: → months 1-4: $1,400-1,600/month - felt like manageable infrastructure overhead → months 5-9: crept to $1,900-2,100 as i started running DeepSeek-class experiments, costs tracking directly with model size → months 10-12: one agent loop ran for 36 hours against a 130B model while i slept, that month hit $2,400 → month 13: ran the cumulative total for the first time, saw $23,800, felt physically sick → month 14: another $2,800 month while i waited for the hardware to ship the box is the NVIDIA DGX Spark - roughly the footprint of a large mac mini, powered by a GB10 Grace Blackwell chip with 128GB of unified LPDDR5X memory that unified memory is the whole thing an RTX 4090 has 24GB of VRAM, which means a 70B model in full BF16 precision physically does not fit, you're quantizing down or you're renting cloud, those are your options this box loads a 200B parameter model quantized and serves it through vLLM over localhost, same API interface the cloud endpoint used the migration took one line of code - i changed the base URL from the provider's endpoint to 127.0.0.1:8000 and everything just worked electricity to run continuous 200B inference locally comes out to about $12/month the payback arithmetic is almost too clean: $2,999 hardware cost against $1,900/month saved, the box paid for itself before i'd owned it two months what i didn't account for was how completely the cost model changes your behavior when there's no hourly meter running, you greenlight experiments you'd never approve on cloud - agent loops that churn for hours, running 10,000 documents through a reasoning pass at 3am, speculative fine-tuning jobs you'd normally skip because the cost felt unjustifiable i ran more experiments in the first 30 days after the box arrived than in the four months before it the loyalty discount email landed about 8 weeks after i cancelled the cloud subscription 15% off my next three months, valued customer, we'd love to have you back i didn't reply the box was already running

i spent $26,600 on cloud GPU rentals over 14 months before i found a NVIDIA DGX Spark at $2,999 (founder's edition) or $3,999 (shipping price) it paid for itself in 6 weeks i run 200B parameter models locally now and my old cloud provider keeps sending me loyalty discount emails the math on that $26,600 is embarrassing to type out loud $1,900/month for 14 months, H100 instances on a specialist cloud provider, because anything bigger than a 70B model simply would not fit anywhere else i paid the invoices like they were a utility bill and told myself it was just the cost of doing serious AI work it took me over a year to find out it wasn't 14 months, broken down: → months 1-4: $1,400-1,600/month - felt like manageable infrastructure overhead → months 5-9: crept to $1,900-2,100 as i started running DeepSeek-class experiments, costs tracking directly with model size → months 10-12: one agent loop ran for 36 hours against a 130B model while i slept, that month hit $2,400 → month 13: ran the cumulative total for the first time, saw $23,800, felt physically sick → month 14: another $2,800 month while i waited for the hardware to ship the box is the NVIDIA DGX Spark - roughly the footprint of a large mac mini, powered by a GB10 Grace Blackwell chip with 128GB of unified LPDDR5X memory that unified memory is the whole thing an RTX 4090 has 24GB of VRAM, which means a 70B model in full BF16 precision physically does not fit, you're quantizing down or you're renting cloud, those are your options this box loads a 200B parameter model quantized and serves it through vLLM over localhost, same API interface the cloud endpoint used the migration took one line of code - i changed the base URL from the provider's endpoint to 127.0.0.1:8000 and everything just worked electricity to run continuous 200B inference locally comes out to about $12/month the payback arithmetic is almost too clean: $2,999 hardware cost against $1,900/month saved, the box paid for itself before i'd owned it two months what i didn't account for was how completely the cost model changes your behavior when there's no hourly meter running, you greenlight experiments you'd never approve on cloud - agent loops that churn for hours, running 10,000 documents through a reasoning pass at 3am, speculative fine-tuning jobs you'd normally skip because the cost felt unjustifiable i ran more experiments in the first 30 days after the box arrived than in the four months before it the loyalty discount email landed about 8 weeks after i cancelled the cloud subscription 15% off my next three months, valued customer, we'd love to have you back i didn't reply the box was already running

Argona

22,099 次观看 • 1 个月前

NVIDIA JUST CAME FOR INTEL AND AMD WITH FOUR NUMBERS Microsoft and NVIDIA posted the exact same tweet at the exact same time - "A new era of PC" and a set of coordinates. The numbers point straight to the Taipei Music Center, where Jensen Huang takes the stage June 1. It is the N1X - NVIDIA's first ever PC processor, and they are finally about to show it. - The N1X is an Arm laptop chip with 6144 CUDA cores, putting its graphics in RTX 5070 territory - Microsoft is building a special Windows 11 version just for this chip - normal PCs do not even get it - NVIDIA has been teasing the N1X since 2023 and ghosted us at Computex 2025 AND CES 2026 - ASUS, Dell and Lenovo already have laptops loaded and ready to launch Three years of rumors. Five corporate accounts posting in sync. One keynote left. The chip that was never real is about to crash the party Intel and AMD have owned for decades.

NVIDIA JUST CAME FOR INTEL AND AMD WITH FOUR NUMBERS Microsoft and NVIDIA posted the exact same tweet at the exact same time - "A new era of PC" and a set of coordinates. The numbers point straight to the Taipei Music Center, where Jensen Huang takes the stage June 1. It is the N1X - NVIDIA's first ever PC processor, and they are finally about to show it. - The N1X is an Arm laptop chip with 6144 CUDA cores, putting its graphics in RTX 5070 territory - Microsoft is building a special Windows 11 version just for this chip - normal PCs do not even get it - NVIDIA has been teasing the N1X since 2023 and ghosted us at Computex 2025 AND CES 2026 - ASUS, Dell and Lenovo already have laptops loaded and ready to launch Three years of rumors. Five corporate accounts posting in sync. One keynote left. The chip that was never real is about to crash the party Intel and AMD have owned for decades.

BuBBliK

689,162 次观看 • 1 个月前

LLMs have static knowledge cutoffs. They don't know about library updates, new APIs, or breaking changes that happened after training. Context7 by Upstash bridges this gap by injecting real-time docs into Cline. 🧵

LLMs have static knowledge cutoffs. They don't know about library updates, new APIs, or breaking changes that happened after training. Context7 by Upstash bridges this gap by injecting real-time docs into Cline. 🧵

Cline

31,890 次观看 • 1 年前

Claude Max subscribers: You can use your subscription in Cline instead of paying per-token API pricing. Here's how you can set it up 🧵

Claude Max subscribers: You can use your subscription in Cline instead of paying per-token API pricing. Here's how you can set it up 🧵

Cline

141,897 次观看 • 1 年前

This guy built a mini AI farm out of 4 Nvidia boxes It does not look like a data center. It looks like a stack of small machines sitting next to a laptop. But each box is a DGX Spark with Grace Blackwell inside, 128GB unified memory, and enough room to run models normal gaming GPUs cannot even open. Using the launch price from the article, 4 of them is almost $12,000 of local AI compute on one desk. That sounds expensive until you compare it to cloud GPUs. A serious AI builder can burn $1,500 to $3,000 a month renting A100s and H100s for client work, fine-tunes, agents and 70B models. He basically moved that bill from the cloud into hardware he owns. 4 Nvidia boxes. 512GB unified memory. No hourly meter running in the background. No rented GPUs eating the margin every time an agent runs too long. The funny part is most people still think local AI means a slow laptop running a toy model. Meanwhile guys like this are stacking compute at home. Save this, local AI is turning into the new mining farm.

This guy built a mini AI farm out of 4 Nvidia boxes It does not look like a data center. It looks like a stack of small machines sitting next to a laptop. But each box is a DGX Spark with Grace Blackwell inside, 128GB unified memory, and enough room to run models normal gaming GPUs cannot even open. Using the launch price from the article, 4 of them is almost $12,000 of local AI compute on one desk. That sounds expensive until you compare it to cloud GPUs. A serious AI builder can burn $1,500 to $3,000 a month renting A100s and H100s for client work, fine-tunes, agents and 70B models. He basically moved that bill from the cloud into hardware he owns. 4 Nvidia boxes. 512GB unified memory. No hourly meter running in the background. No rented GPUs eating the margin every time an agent runs too long. The funny part is most people still think local AI means a slow laptop running a toy model. Meanwhile guys like this are stacking compute at home. Save this, local AI is turning into the new mining farm.

Gipp 🦅

590,100 次观看 • 1 个月前

Open source AI is actually moving at an unhinged pace right now. I literally hadn't even finished typing up my last Gemma 4 12b benchmark notes before Google went ahead and dropped the official Quantization Aware Training (QAT) checkpoints on Hugging Face. If you missed the news, QAT basically bakes the compression directly into the training process. Instead of standard post training quantization degrading the model's reasoning capabilities, QAT trains the model with compression in mind. Unsloth is reporting near original performance at 4-bit with ~72% lower memory footprint. Details in the comments. Naturally, had to instantly pull the new GGUFs to see what a single RTX 4090 card (24 GB VRAM, Cuda 12.8, ubuntu 22) could do. i fired up llama.cpp engine again Look at these numbers: 1. Unsloth Gemma 4 26B-A4B IT (QAT Q4_K_XL) flags: ./build/bin/llama-cli -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 250000 -fa on -v VRAM Used: 19.5 GB context: 250,000 tokens decode throughput: 193 tps 2. Unsloth Gemma 4 31B IT (QAT Q4_K_XL) flags: Command: ./build/bin/llama-cli -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 60000 -fa on -v - VRAM Used: 23 GB (Tight, but zero system RAM spillover) - context: 60,000 tokens - decode throughput: 47 tps We are essentially watching hardware bottlenecks evaporate in real time. An update literally drops before you can finish benchmarking the previous one. What a time to be running local hardware. If you have a single rtx 3090, rtx 4090, these are the latest gemma models to try this week.

Open source AI is actually moving at an unhinged pace right now. I literally hadn't even finished typing up my last Gemma 4 12b benchmark notes before Google went ahead and dropped the official Quantization Aware Training (QAT) checkpoints on Hugging Face. If you missed the news, QAT basically bakes the compression directly into the training process. Instead of standard post training quantization degrading the model's reasoning capabilities, QAT trains the model with compression in mind. Unsloth is reporting near original performance at 4-bit with ~72% lower memory footprint. Details in the comments. Naturally, had to instantly pull the new GGUFs to see what a single RTX 4090 card (24 GB VRAM, Cuda 12.8, ubuntu 22) could do. i fired up llama.cpp engine again Look at these numbers: 1. Unsloth Gemma 4 26B-A4B IT (QAT Q4_K_XL) flags: ./build/bin/llama-cli -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 250000 -fa on -v VRAM Used: 19.5 GB context: 250,000 tokens decode throughput: 193 tps 2. Unsloth Gemma 4 31B IT (QAT Q4_K_XL) flags: Command: ./build/bin/llama-cli -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 60000 -fa on -v - VRAM Used: 23 GB (Tight, but zero system RAM spillover) - context: 60,000 tokens - decode throughput: 47 tps We are essentially watching hardware bottlenecks evaporate in real time. An update literally drops before you can finish benchmarking the previous one. What a time to be running local hardware. If you have a single rtx 3090, rtx 4090, these are the latest gemma models to try this week.

Alok

26,841 次观看 • 1 个月前

Available in the MCP Marketplace: Apple Native Tools Watch Cline use the new Apple MCP server to read project details from Mail, automatically create tasks in Reminders, make a related code edit, and then send an update email -- all without leaving VS Code. Stop juggling apps and losing focus. Automate the admin work that pulls you away from coding. See how Cline integrates your personal productivity tools into your development workflow.

Available in the MCP Marketplace: Apple Native Tools Watch Cline use the new Apple MCP server to read project details from Mail, automatically create tasks in Reminders, make a related code edit, and then send an update email -- all without leaving VS Code. Stop juggling apps and losing focus. Automate the admin work that pulls you away from coding. See how Cline integrates your personal productivity tools into your development workflow.

Cline

27,192 次观看 • 1 年前

Ever wished your AI coding agent had perfect memory? 🧠 Memory Bank is a game-changing approach that gives AI persistent memory across coding sessions. Here's how it works and why you should be using it: 🧵/

Ever wished your AI coding agent had perfect memory? 🧠 Memory Bank is a game-changing approach that gives AI persistent memory across coding sessions. Here's how it works and why you should be using it: 🧵/

Cline

64,644 次观看 • 1 年前

Pro tip: stop re-explaining project conventions to Cline. Use the /newrule command to capture standards once from your task context, then easily reuse them later.

Pro tip: stop re-explaining project conventions to Cline. Use the /newrule command to capture standards once from your task context, then easily reuse them later.

Cline

14,748 次观看 • 1 年前

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Alok

200,913 次观看 • 1 个月前

Some of our power users have a weird habit: they make Cline say 'YARRR!' before writing any code. Sounds crazy, but there's genius behind it. Here's what they discovered about getting 10x better output 👇

Some of our power users have a weird habit: they make Cline say 'YARRR!' before writing any code. Sounds crazy, but there's genius behind it. Here's what they discovered about getting 10x better output 👇

Cline

227,221 次观看 • 1 年前

How are developers scaling their work with AI? Some Cline users are moving beyond 1:1 pairing, experimenting with orchestrating multiple instances simultaneously. Imagine managing several Cline workstreams, each on a different feature branch, tackling tasks in parallel. Integration happens via standard Git PRs, sometimes even reviewed/merged by another dedicated Cline instance. This points to a potential future where developers orchestrate AI-driven workstreams, leveraging familiar Git practices to manage parallel development. The video below shows one example of this approach in action. It's fascinating seeing the community explore this, and it gives us ideas...

How are developers scaling their work with AI? Some Cline users are moving beyond 1:1 pairing, experimenting with orchestrating multiple instances simultaneously. Imagine managing several Cline workstreams, each on a different feature branch, tackling tasks in parallel. Integration happens via standard Git PRs, sometimes even reviewed/merged by another dedicated Cline instance. This points to a potential future where developers orchestrate AI-driven workstreams, leveraging familiar Git practices to manage parallel development. The video below shows one example of this approach in action. It's fascinating seeing the community explore this, and it gives us ideas...

Cline

57,335 次观看 • 1 年前

🚀 Cline v3.7 is now available with an exciting interface upgrade! Choose your path with just a click -- Cline now displays selectable options when asking questions or presenting plans, eliminating the need to type responses and making interactions faster and more intuitive. Other updates: - Added support for a .clinerules/ directory to load multiple files at once - Prevent Cline from reading extremely large files into context that would overload context window - Improved checkpoints loading performance and display warning for large projects not suited for checkpoints - Added SambaNova API provider - Enabled VPC endpoint option for AWS Bedrock profiles - Added DeepSeek-R1 to AWS Bedrock Download it now in VS Code.

🚀 Cline v3.7 is now available with an exciting interface upgrade! Choose your path with just a click -- Cline now displays selectable options when asking questions or presenting plans, eliminating the need to type responses and making interactions faster and more intuitive. Other updates: - Added support for a .clinerules/ directory to load multiple files at once - Prevent Cline from reading extremely large files into context that would overload context window - Improved checkpoints loading performance and display warning for large projects not suited for checkpoints - Added SambaNova API provider - Enabled VPC endpoint option for AWS Bedrock profiles - Added DeepSeek-R1 to AWS Bedrock Download it now in VS Code.

Cline

22,006 次观看 • 1 年前

Cline 3.38.3 is live now! New: - Expanded Hooks functionality and UI - Grok 4.1 & Grok Code added to XAI - Native tool calling for Baseten & Kimi K2 - Thinking level for Gemini 3.0 Pro preview Fixes for slash commands, Vertex, Windows terminal & thinking/reasoning across providers

Cline 3.38.3 is live now! New: - Expanded Hooks functionality and UI - Grok 4.1 & Grok Code added to XAI - Native tool calling for Baseten & Kimi K2 - Thinking level for Gemini 3.0 Pro preview Fixes for slash commands, Vertex, Windows terminal & thinking/reasoning across providers

Cline

14,970 次观看 • 7 个月前