Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

i just beat Google DeepMind's turboquant introducing Shard. 10x KV cache compression on Llama-3.1-8B. zero quality loss - 10x @ 8K context, 11.2x @ 32K - NIAH recall 1.000 across 4K-32K - LongBench Δ ≈ 0 vs FP16 turboquant tops out at 4-6x at the same quality. we doubled... show more

Krish

1,825 subscribers

155,670 görüntüleme • 2 ay önce •via X (Twitter)

Eğitim Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Your local AI just got up to 5x more memory. Same model. Same device. Nearly zero accuracy loss. QVAC SDK 0.12.0 integrates TurboQuant - Google Research's latest memory optimisation algorithm. What is TurboQuant? The KV cache is the memory your model uses to track a conversation. As context grows, it fills up fast. 32K tokens. 64K. Game over. TurboQuant compresses it up to 5x with no accuracy loss. What does it unlock for you? Your app had a 16K token ceiling? It's now 96K. On the same device. Just update the QVAC SDK to get up to 5x more efficiency. No code changes. All from one SDK. The TurboQuant integration unlocks sovereign intelligence for more people, on more devices. Learn more →

Your local AI just got up to 5x more memory. Same model. Same device. Nearly zero accuracy loss. QVAC SDK 0.12.0 integrates TurboQuant - Google Research's latest memory optimisation algorithm. What is TurboQuant? The KV cache is the memory your model uses to track a conversation. As context grows, it fills up fast. 32K tokens. 64K. Game over. TurboQuant compresses it up to 5x with no accuracy loss. What does it unlock for you? Your app had a 16K token ceiling? It's now 96K. On the same device. Just update the QVAC SDK to get up to 5x more efficiency. No code changes. All from one SDK. The TurboQuant integration unlocks sovereign intelligence for more people, on more devices. Learn more →

QVAC

15,799,748 görüntüleme • 1 ay önce

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (Tom Turney). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

Alok

119,821 görüntüleme • 1 ay önce

A single RTX 4090 (24 GB VRAM) can run the updated gemma 4 31B (dense) model with a 190,000 context window at 33 tokens/second. The VRAM barrier is dying. Google quietly updated Gemma 4, and Unsloth immediately compiled the new quants. I built llama.cpp from source on Ubuntu 22 to benchmark it. Google's stealth update 2 days ago enabled uniform Flash Attention 4 on Hopper to boost prefill and patched the chat template to improve tool calling. The agentic reasoning gains on the benchmark charts are massive: TB2 (Agents): +4.5% (to 25.8%) Tau2 (Telecom): +10.1% (to 62.7%) Running on Ubuntu 22, CUDA 13.0 with a single NVIDIA GeForce RTX 4090. Here is the exact step by step benchmarking process with a massive 28k tokens prompt and the commands I used to squeeze out maximum context without killing my throughput: # 1. The Baseline (Unquantized KV Cache) I started with full GPU offload (-ngl 99) and pushed the context to 40k. llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -ngl 99 -c 40000 -fa on --port 8080 -v VRAM: 23.8 GB (maxed out on card) Throughput: Prefill: 2198.81 t/s | Decode: 35.77 t/s (with 28k tokens prompt) # 2. The CPU Split Trap I tried stretching to 80k context by offloading layers to the CPU (-ngl 52). llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 80000 -ngl 52 -fa on --port 8080 -v Throughput: Prefill: 1212.73 t/s | Decode: 5 t/s (with 28k tokens prompt) # 3. The KV Quantization Breakthrough Instead of spilling layers to the CPU, I kept the model fully on card (-ngl 99) but enabled 8-bit KV cache quantization to free up VRAM. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --port 8080 -v VRAM: 23.9 GB Throughput: Prefill: 2139.68 t/s | Decode: 32 t/s (with 28k tokens prompt) Result: 100k tokens of context on a single GPU with practically zero speed loss (and minimal intelligence loss). # 4. The Limit Test (Q4 KV Cache) To find the absolute breaking point, I dropped the KV cache to 4 bit (q4_0) and set -c 190000. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 190000 --cache-type-k q4_0 --cache-type-v q4_0 -ngl 99 --port 8080 -v VRAM: 23.8 GB Throughput: Prefill: 2206.66 t/s | Decode: 33 t/s (with 28k tokens prompt) (Note: Pushing it to 220k required dropping to -ngl 58 again, which immediately penalized decode down to 17 t/s). # The Tradeoff: For Max Reasoning: Keep your KV cache unquantized (f16). You get pristine reasoning but hit a strict 40k context ceiling. For Massive Document Retrieval: If you need to feed the model giant codebases, use --cache-type-k q4_0. Getting 190k context at 33 tokens/second on a consumer desktop with a 31b dense model is a cheat code. If you’re rocking a single 3090 or 4090 and slept on Gemma 4 earlier, this update is your cue to dust off the terminal. Hugging Face links to the Unsloth QAT quants are in the replies below.

A single RTX 4090 (24 GB VRAM) can run the updated gemma 4 31B (dense) model with a 190,000 context window at 33 tokens/second. The VRAM barrier is dying. Google quietly updated Gemma 4, and Unsloth immediately compiled the new quants. I built llama.cpp from source on Ubuntu 22 to benchmark it. Google's stealth update 2 days ago enabled uniform Flash Attention 4 on Hopper to boost prefill and patched the chat template to improve tool calling. The agentic reasoning gains on the benchmark charts are massive: TB2 (Agents): +4.5% (to 25.8%) Tau2 (Telecom): +10.1% (to 62.7%) Running on Ubuntu 22, CUDA 13.0 with a single NVIDIA GeForce RTX 4090. Here is the exact step by step benchmarking process with a massive 28k tokens prompt and the commands I used to squeeze out maximum context without killing my throughput: # 1. The Baseline (Unquantized KV Cache) I started with full GPU offload (-ngl 99) and pushed the context to 40k. llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -ngl 99 -c 40000 -fa on --port 8080 -v VRAM: 23.8 GB (maxed out on card) Throughput: Prefill: 2198.81 t/s | Decode: 35.77 t/s (with 28k tokens prompt) # 2. The CPU Split Trap I tried stretching to 80k context by offloading layers to the CPU (-ngl 52). llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 80000 -ngl 52 -fa on --port 8080 -v Throughput: Prefill: 1212.73 t/s | Decode: 5 t/s (with 28k tokens prompt) # 3. The KV Quantization Breakthrough Instead of spilling layers to the CPU, I kept the model fully on card (-ngl 99) but enabled 8-bit KV cache quantization to free up VRAM. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --port 8080 -v VRAM: 23.9 GB Throughput: Prefill: 2139.68 t/s | Decode: 32 t/s (with 28k tokens prompt) Result: 100k tokens of context on a single GPU with practically zero speed loss (and minimal intelligence loss). # 4. The Limit Test (Q4 KV Cache) To find the absolute breaking point, I dropped the KV cache to 4 bit (q4_0) and set -c 190000. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 190000 --cache-type-k q4_0 --cache-type-v q4_0 -ngl 99 --port 8080 -v VRAM: 23.8 GB Throughput: Prefill: 2206.66 t/s | Decode: 33 t/s (with 28k tokens prompt) (Note: Pushing it to 220k required dropping to -ngl 58 again, which immediately penalized decode down to 17 t/s). # The Tradeoff: For Max Reasoning: Keep your KV cache unquantized (f16). You get pristine reasoning but hit a strict 40k context ceiling. For Massive Document Retrieval: If you need to feed the model giant codebases, use --cache-type-k q4_0. Getting 190k context at 33 tokens/second on a consumer desktop with a 31b dense model is a cheat code. If you’re rocking a single 3090 or 4090 and slept on Gemma 4 earlier, this update is your cue to dust off the terminal. Hugging Face links to the Unsloth QAT quants are in the replies below.

Alok

75,482 görüntüleme • 11 gün önce

New RLM trajectory that blew my mind! I will use this one as the main example in the YT tutorial. I passed in a CSV containing transcripts of 320 episodes of the Lex Fridman podcast and asked it to find what his first 10 ML guests had to say about AGI. The context had 60,855,062 characters. > Main agent explored data format, understood its CSV > extracted all 320 guests, identified the first 10 ML guys (Benegio, Brockman, Goodfellow etc) > Launched parallel subagents passing just their corresponding transcripts (about 35K chars each) > Subagents performed find operations to search for AGI, read the context and returned outputs > Main agent gathered all the data, generated a summary of all AGI conversations It took 4 minutes to crunch, and the fun part is it cost me 0.2$ with Minimax-M2.5. It read 1M tokens (825K was cache hits so it was quite cheap), produced just 69K tokens (19K were reasoning). ---- My notes: - This would be basically impossible to do at this quality with a base LM. (context rot, since 99% of the data is useless) - It will cost 20x more with ReAct model (too many tasks) - It will cost 10x more with a React + Subagent model (read/write contexts instead of using symbolic variables) - I'm a happy panda. (thanks for reading)

New RLM trajectory that blew my mind! I will use this one as the main example in the YT tutorial. I passed in a CSV containing transcripts of 320 episodes of the Lex Fridman podcast and asked it to find what his first 10 ML guests had to say about AGI. The context had 60,855,062 characters. > Main agent explored data format, understood its CSV > extracted all 320 guests, identified the first 10 ML guys (Benegio, Brockman, Goodfellow etc) > Launched parallel subagents passing just their corresponding transcripts (about 35K chars each) > Subagents performed find operations to search for AGI, read the context and returned outputs > Main agent gathered all the data, generated a summary of all AGI conversations It took 4 minutes to crunch, and the fun part is it cost me 0.2$ with Minimax-M2.5. It read 1M tokens (825K was cache hits so it was quite cheap), produced just 69K tokens (19K were reasoning). ---- My notes: - This would be basically impossible to do at this quality with a base LM. (context rot, since 99% of the data is useless) - It will cost 20x more with ReAct model (too many tasks) - It will cost 10x more with a React + Subagent model (read/write contexts instead of using symbolic variables) - I'm a happy panda. (thanks for reading)

AVB

43,794 görüntüleme • 5 ay önce

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

Avi Chawla

269,406 görüntüleme • 1 ay önce

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

Akshay 🚀

57,691 görüntüleme • 1 gün önce

The Cost of Intelligence is Heading to Zero | Hyperspace P2P Distributed Cache We present to you our breakthrough cross-domain work across AI, distributed systems, cryptography, game theory to solve the primary structural inefficiency at the heart of AI infrastructure: most inference is redundant. Google has reported that only 15% of daily searches are truly novel. The rest are repeats or close variants. LLM inference inherits this same power-law distribution. Enterprise chatbots see 70-80% of queries fall into a handful of intent categories. System prompts are identical across 100% of requests within an application. The KV attention state for "You are a helpful assistant" has been computed billions of times, on millions of GPUs, identically. And yet every AI lab, every startup, every self-hosted deployment - computes and caches these results independently. There is no shared layer. No global memory. Every provider pays the full compute cost for every query, even when the answer already exists somewhere in the network. This is the problem Hyperspace solves where distributed cache operates at three levels, each catching a different class of redundancy: 1. Response cache Same prompt, same model, same parameters - instant cached response from any node in the network. SHA-256 hash lookup via DHT, with cryptographic cache proofs linking every response to its original inference execution. No trust required. Fetchers re-announce as providers, so popular responses replicate naturally across more nodes. 2. KV prefix cache Same system prompt tokens - skip the most expensive part of inference entirely. Prefill (computing Key-Value attention states) is deterministic: same model plus same tokens always produces identical KV state. The network caches these states using erasure coding and distributes them via the routing network. New questions that share a common prefix resume generation from cached state instead of recomputing from scratch. 3. Routing to cached nodes Instead of transferring KV state across the network for every request, Hyperspace routes the request to the node that already has the state loaded in VRAM. The request goes to the cache, not the cache to the request. Together, these three layers mean that 70-90% of inference requests at network scale never require full GPU computation. This work doesn't exist in isolation. It builds on research from across the industry: SGLang's RadixAttention demonstrated that automatic prefix sharing can yield up to 5x speedup on structured LLM workloads. Moonshot AI's Mooncake built an entire KV-cache-centric disaggregated architecture for production serving at Kimi. Anthropic, OpenAI, and Google all launched prompt caching products in 2024 - priced at 50-90% discounts - because system prompt reuse is so pervasive that it changes the economics of inference. What all of these systems share is a common limitation: they operate within a single organization's infrastructure. SGLang caches prefixes within one server. Mooncake disaggregates KV cache within one datacenter. Anthropic's prompt caching works within one API provider's fleet. None of them can share cached state across organizational boundaries. Hyperspace removes this boundary. The cache is global. A response computed by a node in Tokyo is immediately available to a node in Berlin. A KV prefix state generated for Qwen-32B on one machine is verifiable and reusable by any other machine running the same model. The routing network provides the delivery guarantees, the erasure coding provides the redundancy, and the cache proofs provide the trust. What this means for the cost of intelligence Big AI labs scale linearly: twice the users means twice the GPU spend. Every query is a cost center. Their internal caching helps, but it's siloed - Lab A's cache can't serve Lab B's users, and neither can serve a self-hosted Llama deployment. Hyperspace scales sub-linearly. Every new node that joins the network adds to the global cache. Every inference result enriches the cache for all future requests. The cache hit rate rises with network size because query distributions follow a power law - the most common questions are asked exponentially more often than rare ones. The implication is simple: as the network grows, the effective cost per inference drops. Not linearly. Logarithmically. At 10 million nodes, we estimate 75-90% of all inference requests can be served from cache, eliminating 400,000+ MWh of energy consumption per year and avoiding over 200,000 tons of CO2 emissions. The first person to ask a question pays the compute cost. Everyone after them gets the answer for free, with cryptographic proof that it's authentic. Training is competitive. Inference is shared Open-weight models are converging on quality with closed models. Labs will continue to differentiate on training - data curation, architecture innovation, RLHF tuning. That's where the real intellectual property lives. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte, regardless of whose GPU runs the matrix multiplication. There is no moat in multiplying matrices. The moat is in training the weights. A global distributed cache makes this separation explicit. It doesn't matter who trained the model. Once the weights are open, the inference cost approaches zero at scale - because the network remembers every answer and can prove it's correct. No lab, no matter how well-funded, can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically. The marginal cost of intelligence approaches zero. That's the endgame.

The Cost of Intelligence is Heading to Zero | Hyperspace P2P Distributed Cache We present to you our breakthrough cross-domain work across AI, distributed systems, cryptography, game theory to solve the primary structural inefficiency at the heart of AI infrastructure: most inference is redundant. Google has reported that only 15% of daily searches are truly novel. The rest are repeats or close variants. LLM inference inherits this same power-law distribution. Enterprise chatbots see 70-80% of queries fall into a handful of intent categories. System prompts are identical across 100% of requests within an application. The KV attention state for "You are a helpful assistant" has been computed billions of times, on millions of GPUs, identically. And yet every AI lab, every startup, every self-hosted deployment - computes and caches these results independently. There is no shared layer. No global memory. Every provider pays the full compute cost for every query, even when the answer already exists somewhere in the network. This is the problem Hyperspace solves where distributed cache operates at three levels, each catching a different class of redundancy: 1. Response cache Same prompt, same model, same parameters - instant cached response from any node in the network. SHA-256 hash lookup via DHT, with cryptographic cache proofs linking every response to its original inference execution. No trust required. Fetchers re-announce as providers, so popular responses replicate naturally across more nodes. 2. KV prefix cache Same system prompt tokens - skip the most expensive part of inference entirely. Prefill (computing Key-Value attention states) is deterministic: same model plus same tokens always produces identical KV state. The network caches these states using erasure coding and distributes them via the routing network. New questions that share a common prefix resume generation from cached state instead of recomputing from scratch. 3. Routing to cached nodes Instead of transferring KV state across the network for every request, Hyperspace routes the request to the node that already has the state loaded in VRAM. The request goes to the cache, not the cache to the request. Together, these three layers mean that 70-90% of inference requests at network scale never require full GPU computation. This work doesn't exist in isolation. It builds on research from across the industry: SGLang's RadixAttention demonstrated that automatic prefix sharing can yield up to 5x speedup on structured LLM workloads. Moonshot AI's Mooncake built an entire KV-cache-centric disaggregated architecture for production serving at Kimi. Anthropic, OpenAI, and Google all launched prompt caching products in 2024 - priced at 50-90% discounts - because system prompt reuse is so pervasive that it changes the economics of inference. What all of these systems share is a common limitation: they operate within a single organization's infrastructure. SGLang caches prefixes within one server. Mooncake disaggregates KV cache within one datacenter. Anthropic's prompt caching works within one API provider's fleet. None of them can share cached state across organizational boundaries. Hyperspace removes this boundary. The cache is global. A response computed by a node in Tokyo is immediately available to a node in Berlin. A KV prefix state generated for Qwen-32B on one machine is verifiable and reusable by any other machine running the same model. The routing network provides the delivery guarantees, the erasure coding provides the redundancy, and the cache proofs provide the trust. What this means for the cost of intelligence Big AI labs scale linearly: twice the users means twice the GPU spend. Every query is a cost center. Their internal caching helps, but it's siloed - Lab A's cache can't serve Lab B's users, and neither can serve a self-hosted Llama deployment. Hyperspace scales sub-linearly. Every new node that joins the network adds to the global cache. Every inference result enriches the cache for all future requests. The cache hit rate rises with network size because query distributions follow a power law - the most common questions are asked exponentially more often than rare ones. The implication is simple: as the network grows, the effective cost per inference drops. Not linearly. Logarithmically. At 10 million nodes, we estimate 75-90% of all inference requests can be served from cache, eliminating 400,000+ MWh of energy consumption per year and avoiding over 200,000 tons of CO2 emissions. The first person to ask a question pays the compute cost. Everyone after them gets the answer for free, with cryptographic proof that it's authentic. Training is competitive. Inference is shared Open-weight models are converging on quality with closed models. Labs will continue to differentiate on training - data curation, architecture innovation, RLHF tuning. That's where the real intellectual property lives. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte, regardless of whose GPU runs the matrix multiplication. There is no moat in multiplying matrices. The moat is in training the weights. A global distributed cache makes this separation explicit. It doesn't matter who trained the model. Once the weights are open, the inference cost approaches zero at scale - because the network remembers every answer and can prove it's correct. No lab, no matter how well-funded, can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically. The marginal cost of intelligence approaches zero. That's the endgame.

Varun

37,362 görüntüleme • 4 ay önce

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

157,390 görüntüleme • 2 ay önce

Atomic Agent beat Hermes on GAIA: 69.8% vs 58.5%, and it was 1.6x faster! We ran both agents through the full GAIA Level 1 benchmark, 53 real-world tasks, same 4-bit qwen-3.6-35b on the same Apple M4 Max. Results: ✦ Atomic Agent: 37 of 53 solved, done in 3h 12m ✦ Hermes Agent: 31 of 53 solved, took 5h 10m Atomic solved 6 more tasks and finished nearly 2 hours sooner. Hermes ran into the 900s timeout on 7 tasks; Atomic on just 2. Hermes burned 71% of its total time on tasks it still failed, Atomic, 48%. Where it showed: ✦ Audre Lorde poem, which stanza is indented: Atomic pushed through a dead source, switched tools, and answered in 7.6 min. Hermes ran the full clock and returned a blank. ✦ Vietnamese specimens, which city they ended up in: Atomic pulled it from the first source and normalized the answer in 33s. Hermes spent 7.3 min and never answered. ✦ The dinosaur featured-article nominator: Atomic walked the Wikipedia chain to "FunkMonk" in 57s. Hermes guessed a wrong name after 11 min. Atomic keeps a byte-stable prompt prefix, so llama-server reuses the KV-cache instead of re-encoding the whole context every turn, and it emits one JSON array of tool calls per inference, then compresses results back instead of pasting them in full, so the context never balloons and a small model stays sharp deep into a task. On top of that a no-progress guard vetoes repeated identical tool calls (warn at 3, hard veto at 5) and forces a reply, so Atomic never sinks 15 minutes into re-scanning one page the way Hermes did. Both agents missed some of the same questions, and on a few Hermes got there and Atomic did not, usually format slips where Atomic computed the right number but printed the working instead of the bare value. But on identical hardware and identical weights, the runtime that reuses its cache and refuses to spin came out ahead on accuracy and speed. Getting this from the runtime alone is wild. Run the same 53 GAIA tasks on Atomic Agent!

Atomic Agent beat Hermes on GAIA: 69.8% vs 58.5%, and it was 1.6x faster! We ran both agents through the full GAIA Level 1 benchmark, 53 real-world tasks, same 4-bit qwen-3.6-35b on the same Apple M4 Max. Results: ✦ Atomic Agent: 37 of 53 solved, done in 3h 12m ✦ Hermes Agent: 31 of 53 solved, took 5h 10m Atomic solved 6 more tasks and finished nearly 2 hours sooner. Hermes ran into the 900s timeout on 7 tasks; Atomic on just 2. Hermes burned 71% of its total time on tasks it still failed, Atomic, 48%. Where it showed: ✦ Audre Lorde poem, which stanza is indented: Atomic pushed through a dead source, switched tools, and answered in 7.6 min. Hermes ran the full clock and returned a blank. ✦ Vietnamese specimens, which city they ended up in: Atomic pulled it from the first source and normalized the answer in 33s. Hermes spent 7.3 min and never answered. ✦ The dinosaur featured-article nominator: Atomic walked the Wikipedia chain to "FunkMonk" in 57s. Hermes guessed a wrong name after 11 min. Atomic keeps a byte-stable prompt prefix, so llama-server reuses the KV-cache instead of re-encoding the whole context every turn, and it emits one JSON array of tool calls per inference, then compresses results back instead of pasting them in full, so the context never balloons and a small model stays sharp deep into a task. On top of that a no-progress guard vetoes repeated identical tool calls (warn at 3, hard veto at 5) and forces a reply, so Atomic never sinks 15 minutes into re-scanning one page the way Hermes did. Both agents missed some of the same questions, and on a few Hermes got there and Atomic did not, usually format slips where Atomic computed the right number but printed the working instead of the bare value. But on identical hardware and identical weights, the runtime that reuses its cache and refuses to spin came out ahead on accuracy and speed. Getting this from the runtime alone is wild. Run the same 53 GAIA tasks on Atomic Agent!

Atomic Agent

109,729 görüntüleme • 5 gün önce

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on Fireworks AI inference. This is the first time I feel like there is an open-weight model that can reason at the level of Claude and Codex. And it does this in a cost-effective way with support for 1M context length. To be clear, I am using DeepSeek-V4-Pro inside of Pi without any special configuration. It works out of the box. It's exciting that there is a model that can just be plugged into a basic harness like Pi, and it just works. I've never seen that before. Most models require lots of configuration and setup. DeepSeek's DeepSeek-V4-Pro is clearly good at agentic coding (probably the best from the open-weight models), but the model is also great on knowledge-intensive tasks where reasoning matters. The agent pulled agentic engineering best practices from different company docs (Anthropic, OpenAI, Google, Stripe, Meta, Modal, DeepSeek, Mistral, Cohere), searched and digested Reddit and HN threads, summarized arxiv papers, and surfaced trending GitHub repos. Then it distilled everything into actionable tips across categories. I love the Wiki it built. The quality is really good. Here is a snapshot of what the wiki looks like: DeepSeek-V4-Pro handled the task without breaking stride. Multi-step research queries, code generation for scaffolding, context-heavy reasoning across disparate sources. For coding specifically, this is the first open-weight model that genuinely feels like a Codex or Claude Code experience. It compares in capability and actual multi-turn agentic work. What made the loop feel so responsive was Fireworks' inference speed (the fastest in the market) and the fact that they actually validate models at the systems level before shipping. No corrupted reasoning traces. Just fast, reliable iteration. The hybrid CSA and HCA attention design cuts KV cache to just 10% and inference FLOPs by nearly 4x at 1M-token context. This is what makes the agent loop actually fast and cheap enough to run in practice. For devs who've been watching open-weight models close the gap but haven't found one that actually delivers in practice, this is the closest I've seen. Try it here:

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on Fireworks AI inference. This is the first time I feel like there is an open-weight model that can reason at the level of Claude and Codex. And it does this in a cost-effective way with support for 1M context length. To be clear, I am using DeepSeek-V4-Pro inside of Pi without any special configuration. It works out of the box. It's exciting that there is a model that can just be plugged into a basic harness like Pi, and it just works. I've never seen that before. Most models require lots of configuration and setup. DeepSeek's DeepSeek-V4-Pro is clearly good at agentic coding (probably the best from the open-weight models), but the model is also great on knowledge-intensive tasks where reasoning matters. The agent pulled agentic engineering best practices from different company docs (Anthropic, OpenAI, Google, Stripe, Meta, Modal, DeepSeek, Mistral, Cohere), searched and digested Reddit and HN threads, summarized arxiv papers, and surfaced trending GitHub repos. Then it distilled everything into actionable tips across categories. I love the Wiki it built. The quality is really good. Here is a snapshot of what the wiki looks like: DeepSeek-V4-Pro handled the task without breaking stride. Multi-step research queries, code generation for scaffolding, context-heavy reasoning across disparate sources. For coding specifically, this is the first open-weight model that genuinely feels like a Codex or Claude Code experience. It compares in capability and actual multi-turn agentic work. What made the loop feel so responsive was Fireworks' inference speed (the fastest in the market) and the fact that they actually validate models at the systems level before shipping. No corrupted reasoning traces. Just fast, reliable iteration. The hybrid CSA and HCA attention design cuts KV cache to just 10% and inference FLOPs by nearly 4x at 1M-token context. This is what makes the agent loop actually fast and cheap enough to run in practice. For devs who've been watching open-weight models close the gap but haven't found one that actually delivers in practice, this is the closest I've seen. Try it here:

elvis

59,803 görüntüleme • 3 ay önce

HERMES AGENT HAS 5 SYSTEMS RUNNING UNDER THE HOOD. UNDERSTAND THEM AND YOU USE THE AGENT 10X BETTER. In this video Alejandro AO 🤗 explained: 1. THE AGENT LOOP every message triggers the same cycle: → you send a message → Hermes builds context (SOUL.md + memory.md + user.md + skills + tools + message history) → sends everything to the LLM → LLM decides: call a tool or respond → if tool call: execute, return result, loop back → if response: deliver to you → after response: memory update (agent checks if anything is worth remembering, writes to memory.md or user.md) this loop is why Hermes gets better over time. the memory update after every response means the agent learns from every conversation. 2. CONTEXT ASSEMBLY what the LLM sees on every turn: → SOUL.md (your agent's personality and rules) → memory.md (facts the agent learned over time) → user.md (facts about you, auto-updated) → AGENTS.md and .hermes.md (project context files) → skill descriptions (loaded on demand) → tool schemas (available actions) → message history (current conversation) if SOUL.md is empty, Hermes falls back to a default system prompt. write your own SOUL.md and the agent becomes yours, not generic. CONTEXT COMPRESSION: conversations hit context limits. Hermes handles this at two checkpoints: preflight: before each turn. if conversation exceeds 50% of context window, compression fires. older messages get summarized. last 20 messages stay intact (protect_last_n). gateway auto-compression: between turns. fires at 85%. more aggressive. prevents API errors before the agent even starts processing your message. after compression, a new session lineage ID is generated. the agent can trace back to the original conversation through SQLite. three things break prompt cache: switching models mid-session, changing memory files, or changing context files. 3. THE GATEWAY the system that keeps Hermes reachable on 27+ messaging platforms. an async loop runs continuously. listens for incoming messages from Telegram, Discord, Slack, WhatsApp, email, SMS, and every other adapter. when a message arrives: → gateway identifies which session it belongs to → queries SQLite for the full message history (session ID = platform prefix + chat ID) → builds the context from scratch → sends everything into the agent loop → delivers the response back to the platform the gateway also runs the session manager. when you send a message while the agent is busy: → default: queued for next turn → /steer: injected without interrupting → /interrupt: stops current work without the gateway, Hermes is a CLI tool. with the gateway, Hermes is an always-on agent you reach from your phone. 4. MEMORY (THREE LAYERS) LAYER 1 — MARKDOWN FILES SOUL.md (identity), memory.md (learned facts), user.md (facts about you). injected into context after the system prompt. updated by the agent after every response. LAYER 2 — SQLITE full transcripts of every session stored locally. FTS5 full-text search across all past conversations. session lineage tracking across compressions. the agent can recall what you discussed weeks ago using /recall or session search. LAYER 3 — EXTERNAL PROVIDERS (optional) 8 supported providers: Mem0, SuperMemory, Honcho, Zep, and more. each works differently (semantic search, LLM extraction, similarity matching). queried after the first message in each session. the agent processes your topic first, then checks external memory for related context from past conversations. not enabled by default. enable for significantly better long-term recall. 5. CRON ENGINE a loop inside the gateway ticks every 60 seconds. each tick checks ~/.hermes/cron/jobs.json for scheduled tasks. if a job is due: → fresh session (no chat history, no memory pollution) → execute the prompt with assigned tools → store the run output as markdown in ~/.hermes/cron/output/[job-id]/ → deliver result to your home messaging platform cron does NOT use the send_message tool. delivery happens at the system level, not the agent level. a cron session cannot create more cron jobs. prevents runaway loops. WHY THIS MATTERS: the agent loop teaches it. the context assembly focuses it. the gateway reaches it. the memory remembers it. the cron engine automates it. five systems. one agent. understanding how they connect changes how you configure every level. full 15 levels breakdown in the article 👇

HERMES AGENT HAS 5 SYSTEMS RUNNING UNDER THE HOOD. UNDERSTAND THEM AND YOU USE THE AGENT 10X BETTER. In this video Alejandro AO 🤗 explained: 1. THE AGENT LOOP every message triggers the same cycle: → you send a message → Hermes builds context (SOUL.md + memory.md + user.md + skills + tools + message history) → sends everything to the LLM → LLM decides: call a tool or respond → if tool call: execute, return result, loop back → if response: deliver to you → after response: memory update (agent checks if anything is worth remembering, writes to memory.md or user.md) this loop is why Hermes gets better over time. the memory update after every response means the agent learns from every conversation. 2. CONTEXT ASSEMBLY what the LLM sees on every turn: → SOUL.md (your agent's personality and rules) → memory.md (facts the agent learned over time) → user.md (facts about you, auto-updated) → AGENTS.md and .hermes.md (project context files) → skill descriptions (loaded on demand) → tool schemas (available actions) → message history (current conversation) if SOUL.md is empty, Hermes falls back to a default system prompt. write your own SOUL.md and the agent becomes yours, not generic. CONTEXT COMPRESSION: conversations hit context limits. Hermes handles this at two checkpoints: preflight: before each turn. if conversation exceeds 50% of context window, compression fires. older messages get summarized. last 20 messages stay intact (protect_last_n). gateway auto-compression: between turns. fires at 85%. more aggressive. prevents API errors before the agent even starts processing your message. after compression, a new session lineage ID is generated. the agent can trace back to the original conversation through SQLite. three things break prompt cache: switching models mid-session, changing memory files, or changing context files. 3. THE GATEWAY the system that keeps Hermes reachable on 27+ messaging platforms. an async loop runs continuously. listens for incoming messages from Telegram, Discord, Slack, WhatsApp, email, SMS, and every other adapter. when a message arrives: → gateway identifies which session it belongs to → queries SQLite for the full message history (session ID = platform prefix + chat ID) → builds the context from scratch → sends everything into the agent loop → delivers the response back to the platform the gateway also runs the session manager. when you send a message while the agent is busy: → default: queued for next turn → /steer: injected without interrupting → /interrupt: stops current work without the gateway, Hermes is a CLI tool. with the gateway, Hermes is an always-on agent you reach from your phone. 4. MEMORY (THREE LAYERS) LAYER 1 — MARKDOWN FILES SOUL.md (identity), memory.md (learned facts), user.md (facts about you). injected into context after the system prompt. updated by the agent after every response. LAYER 2 — SQLITE full transcripts of every session stored locally. FTS5 full-text search across all past conversations. session lineage tracking across compressions. the agent can recall what you discussed weeks ago using /recall or session search. LAYER 3 — EXTERNAL PROVIDERS (optional) 8 supported providers: Mem0, SuperMemory, Honcho, Zep, and more. each works differently (semantic search, LLM extraction, similarity matching). queried after the first message in each session. the agent processes your topic first, then checks external memory for related context from past conversations. not enabled by default. enable for significantly better long-term recall. 5. CRON ENGINE a loop inside the gateway ticks every 60 seconds. each tick checks ~/.hermes/cron/jobs.json for scheduled tasks. if a job is due: → fresh session (no chat history, no memory pollution) → execute the prompt with assigned tools → store the run output as markdown in ~/.hermes/cron/output/[job-id]/ → deliver result to your home messaging platform cron does NOT use the send_message tool. delivery happens at the system level, not the agent level. a cron session cannot create more cron jobs. prevents runaway loops. WHY THIS MATTERS: the agent loop teaches it. the context assembly focuses it. the gateway reaches it. the memory remembers it. the cron engine automates it. five systems. one agent. understanding how they connect changes how you configure every level. full 15 levels breakdown in the article 👇

YanXbt

51,258 görüntüleme • 1 ay önce

Claude Cowork + Google Ads is f*cking cracked 🤯 Set up once → ask Claude questions like: "What's driving my CPA spike this week?" "Which search terms are wasting budget?" "Run a full account audit and tell me the top 5 things to fix." All inside Claude Cowork. Perfect for DTC brands and agencies running Google Ads who are still exporting CSVs every Monday, rebuilding the same pivot table, and trying to figure out why CPA spiked 30% overnight. If your Google Ads workflow still looks like this — log in, stare at columns, download a search term report, open a spreadsheet, highlight the bad ones in red, forget to actually negate them... Claude Cowork does the whole thing in one prompt: → Connects to your live Google Ads data via GoMarble MCP (free, 5-minute setup) → Runs a full account audit across campaigns, ad groups, and keywords → Finds your exact wasted spend in dollars — every search term burning budget with zero conversions → Scores your account health 0-100 across 6 dimensions → Flags creative fatigue, quality score issues, and budget misallocation → Builds a visual HTML dashboard with CPA trends, spend vs conversions, and campaign breakdowns → Writes a weekly performance report your clients or team can actually read No more CSV exports. No more pivot tables. No more "I'll negate those search terms tomorrow." What you get: - 21 specialized Google Ads skills that plug directly into Claude - A full account audit with a health score and prioritized fix list - Negative keyword discovery on autopilot - Search term mining that surfaces hidden winners and budget waste - Visual dashboards you can screenshot and send to clients - Weekly reports written in plain English, not spreadsheet noise I put together the full skill pack: all 21 Google Ads skills for Claude, plus the GoMarble MCP setup guide to get Cowork connected to your accounts in under 5 minutes. Want it for free? > Like this post > Comment "ADS" And I'll send it over (must be following @learnwithella so I can DM)

Claude Cowork + Google Ads is f*cking cracked 🤯 Set up once → ask Claude questions like: "What's driving my CPA spike this week?" "Which search terms are wasting budget?" "Run a full account audit and tell me the top 5 things to fix." All inside Claude Cowork. Perfect for DTC brands and agencies running Google Ads who are still exporting CSVs every Monday, rebuilding the same pivot table, and trying to figure out why CPA spiked 30% overnight. If your Google Ads workflow still looks like this — log in, stare at columns, download a search term report, open a spreadsheet, highlight the bad ones in red, forget to actually negate them... Claude Cowork does the whole thing in one prompt: → Connects to your live Google Ads data via GoMarble MCP (free, 5-minute setup) → Runs a full account audit across campaigns, ad groups, and keywords → Finds your exact wasted spend in dollars — every search term burning budget with zero conversions → Scores your account health 0-100 across 6 dimensions → Flags creative fatigue, quality score issues, and budget misallocation → Builds a visual HTML dashboard with CPA trends, spend vs conversions, and campaign breakdowns → Writes a weekly performance report your clients or team can actually read No more CSV exports. No more pivot tables. No more "I'll negate those search terms tomorrow." What you get: - 21 specialized Google Ads skills that plug directly into Claude - A full account audit with a health score and prioritized fix list - Negative keyword discovery on autopilot - Search term mining that surfaces hidden winners and budget waste - Visual dashboards you can screenshot and send to clients - Weekly reports written in plain English, not spreadsheet noise I put together the full skill pack: all 21 Google Ads skills for Claude, plus the GoMarble MCP setup guide to get Cowork connected to your accounts in under 5 minutes. Want it for free? > Like this post > Comment "ADS" And I'll send it over (must be following @learnwithella so I can DM)

Ismail Khan

17,818 görüntüleme • 2 ay önce

Sam Altman just dropped the most important interview of 2025. And buried in it are four numbers that explain why everything you think about AI is wrong. Here's what he revealed: Number 1: AI companies are generating 10 TRILLION tokens per day. Humans? Average 20,000 tokens per day. Sam's exact words: "Models will output more tokens than all of humanity put together. Then 10x that. Then 100x that." We're not talking about AI assisting human work anymore. We're talking about AI replacing the entire volume of human intellectual output on the planet. And most people have no idea this shift already happened. Number 2: OpenAI's enterprise business is CRUSHING consumer. Everyone thinks OpenAI is ChatGPT for normies. Wrong. Sam just revealed: "Enterprise growth OUTPACED consumer growth this year." The API business is growing faster than ChatGPT. Over 1 million enterprise users already. "If we had double the compute, we'd be at double the revenue right now." Translation: OpenAI isn't compute-constrained by technology. They're revenue-constrained by infrastructure. The bottleneck is supply and not demand. Every dollar of compute they add prints money. Number 3: GPT-5.2 beats you at 74% of your job. Sam revealed OpenAI's internal GDP-Val benchmark. It measures how AI performs on knowledge work tasks across 40+ verticals. The results: GPT-5.2 beats or ties expert-level knowledge workers at 74.1% of tasks. Legal analysis. PowerPoint decks. Web apps. Financial modeling. Customer support. Sam's description: "A co-worker you can assign an hour's worth of tasks to and get something you prefer back 3 out of 4 times." Three years ago, ChatGPT launched at basically 0% on this scale. Now it's at 74%. And that's not GPT-6. That's what's available RIGHT NOW. Most companies haven't even started using this yet. But here's what Sam said about the gap between capability and adoption: "The overhang is going to be massive. Most people are still asking similar questions they did in the GPT-4 realm." Translation: The models can do 10x more than people have figured out how to use them for. Which means there's a HUGE arbitrage opportunity. Early adopters who actually integrate this into workflows will dominate their industries before competitors even understand what happened. Number 4: AGI already happened. And nobody noticed. Sam's exact quote: "AGI kind of went whooshing by. We're in this fuzzy period where some people think we have it and some don't." Read that again. The CEO of OpenAI just said AGI might have already arrived and we're arguing about definitions while it's actively replacing knowledge work. He even moved the goalposts. The new benchmark: "Superintelligence" = when AI can be a better president or CEO than any human. Not "as good as." BETTER than. We went from "can AI pass a Turing test" to "can AI run countries better than humans" in 3 years. So what does this actually mean? The AI revolution isn't about chatbots getting smarter. It's about the complete replacement of human intellectual output with machine output. At scale. Across every industry. Faster than anyone's prepared for. And the companies positioning for this RIGHT NOW are the ones printing money. OpenAI's enterprise growth is outpacing consumer because businesses see what's coming. They're not buying "AI tools." They're buying the ability to 10x output without 10x-ing headcount. Sam said they'll triple their compute next year. Then triple it again. Revenue is growing even faster than that. "We have never found a situation where we can't monetize all the compute we have." If he isn't lying then that's literally a printing press. The market still doesn't get it. Everyone's focused on "AI bubble" fears while OpenAI is solving the only problem that matters: turning compute into revenue at a faster rate than they're spending. They're not hoping demand catches up to supply. Demand is already 2x ahead of what they can deliver. Meanwhile, most knowledge workers are still using GPT-4 prompts on GPT-5.2. The capability overhang is massive. The arbitrage window is open. And it's closing fast. If you're running a B2B business and you're not integrating AI at the level Sam just described, you're not "waiting to see how it plays out." You're getting crushed by competitors who already figured it out. The companies that win in 2026 won't be the ones with the best AI. They'll be the ones who understood what Sam just laid out 6 months before everyone else did.

Sam Altman just dropped the most important interview of 2025. And buried in it are four numbers that explain why everything you think about AI is wrong. Here's what he revealed: Number 1: AI companies are generating 10 TRILLION tokens per day. Humans? Average 20,000 tokens per day. Sam's exact words: "Models will output more tokens than all of humanity put together. Then 10x that. Then 100x that." We're not talking about AI assisting human work anymore. We're talking about AI replacing the entire volume of human intellectual output on the planet. And most people have no idea this shift already happened. Number 2: OpenAI's enterprise business is CRUSHING consumer. Everyone thinks OpenAI is ChatGPT for normies. Wrong. Sam just revealed: "Enterprise growth OUTPACED consumer growth this year." The API business is growing faster than ChatGPT. Over 1 million enterprise users already. "If we had double the compute, we'd be at double the revenue right now." Translation: OpenAI isn't compute-constrained by technology. They're revenue-constrained by infrastructure. The bottleneck is supply and not demand. Every dollar of compute they add prints money. Number 3: GPT-5.2 beats you at 74% of your job. Sam revealed OpenAI's internal GDP-Val benchmark. It measures how AI performs on knowledge work tasks across 40+ verticals. The results: GPT-5.2 beats or ties expert-level knowledge workers at 74.1% of tasks. Legal analysis. PowerPoint decks. Web apps. Financial modeling. Customer support. Sam's description: "A co-worker you can assign an hour's worth of tasks to and get something you prefer back 3 out of 4 times." Three years ago, ChatGPT launched at basically 0% on this scale. Now it's at 74%. And that's not GPT-6. That's what's available RIGHT NOW. Most companies haven't even started using this yet. But here's what Sam said about the gap between capability and adoption: "The overhang is going to be massive. Most people are still asking similar questions they did in the GPT-4 realm." Translation: The models can do 10x more than people have figured out how to use them for. Which means there's a HUGE arbitrage opportunity. Early adopters who actually integrate this into workflows will dominate their industries before competitors even understand what happened. Number 4: AGI already happened. And nobody noticed. Sam's exact quote: "AGI kind of went whooshing by. We're in this fuzzy period where some people think we have it and some don't." Read that again. The CEO of OpenAI just said AGI might have already arrived and we're arguing about definitions while it's actively replacing knowledge work. He even moved the goalposts. The new benchmark: "Superintelligence" = when AI can be a better president or CEO than any human. Not "as good as." BETTER than. We went from "can AI pass a Turing test" to "can AI run countries better than humans" in 3 years. So what does this actually mean? The AI revolution isn't about chatbots getting smarter. It's about the complete replacement of human intellectual output with machine output. At scale. Across every industry. Faster than anyone's prepared for. And the companies positioning for this RIGHT NOW are the ones printing money. OpenAI's enterprise growth is outpacing consumer because businesses see what's coming. They're not buying "AI tools." They're buying the ability to 10x output without 10x-ing headcount. Sam said they'll triple their compute next year. Then triple it again. Revenue is growing even faster than that. "We have never found a situation where we can't monetize all the compute we have." If he isn't lying then that's literally a printing press. The market still doesn't get it. Everyone's focused on "AI bubble" fears while OpenAI is solving the only problem that matters: turning compute into revenue at a faster rate than they're spending. They're not hoping demand catches up to supply. Demand is already 2x ahead of what they can deliver. Meanwhile, most knowledge workers are still using GPT-4 prompts on GPT-5.2. The capability overhang is massive. The arbitrage window is open. And it's closing fast. If you're running a B2B business and you're not integrating AI at the level Sam just described, you're not "waiting to see how it plays out." You're getting crushed by competitors who already figured it out. The companies that win in 2026 won't be the ones with the best AI. They'll be the ones who understood what Sam just laid out 6 months before everyone else did.

Ricardo

358,648 görüntüleme • 7 ay önce

10 repos that cut your ai agent token bill by up to 80% 1. microsoft/LLMLingua → cuts prompt size by up to 95% compresses prompts before the api call. 20x compression. published at EMNLP + ACL. near-zero quality loss. 6,100 stars 2. mem0ai/mem0 → replaces full conversation history in context stores what matters. retrieves only what's needed. 10,000 token history → 200 token memory. per agent. 54,800 stars 3. BerriAI/litellm → routes each call to the cheapest model simple task → haiku. complex task → sonnet. tracks cost per agent, per call, per day. 45,700 stars 4. run-llama/llama_index → replaces sending full documents rag: 100-page doc → 3 relevant chunks → same answer. 98% fewer tokens per query. 49,100 stars 5. chroma-core/chroma → replaces keyword search in full context vector store. finds the closest match. feeds only that. 50-200 tokens per query instead of thousands. 27,800 stars 6. letta-ai/letta → replaces infinite context window crashes paged memory for agents. loads only relevant memory. stops your agent from hitting limits and retrying. 22,400 stars 7. guidance-ai/guidance → cuts output token bloat by 30-50% structured generation. constrains model output natively. no more 100-token prompts to get json back. 21,400 stars 8. Aider-AI/aider → replaces pasting entire codebases builds a repo map. sends only files relevant to the task. not your whole project. just what the agent needs. 44,300 stars 9. openai/tiktoken → count tokens before you send know the exact cost before the api call happens. not after the bill arrives. 18,100 stars 10. simonw/ttok → hard cap on what gets sent cli tool: count tokens, truncate to budget limit. pipe any text in. get truncated output back. 389 stars most agents are expensive not because the model is expensive. because nobody checked what was being sent to it.

10 repos that cut your ai agent token bill by up to 80% 1. microsoft/LLMLingua → cuts prompt size by up to 95% compresses prompts before the api call. 20x compression. published at EMNLP + ACL. near-zero quality loss. 6,100 stars 2. mem0ai/mem0 → replaces full conversation history in context stores what matters. retrieves only what's needed. 10,000 token history → 200 token memory. per agent. 54,800 stars 3. BerriAI/litellm → routes each call to the cheapest model simple task → haiku. complex task → sonnet. tracks cost per agent, per call, per day. 45,700 stars 4. run-llama/llama_index → replaces sending full documents rag: 100-page doc → 3 relevant chunks → same answer. 98% fewer tokens per query. 49,100 stars 5. chroma-core/chroma → replaces keyword search in full context vector store. finds the closest match. feeds only that. 50-200 tokens per query instead of thousands. 27,800 stars 6. letta-ai/letta → replaces infinite context window crashes paged memory for agents. loads only relevant memory. stops your agent from hitting limits and retrying. 22,400 stars 7. guidance-ai/guidance → cuts output token bloat by 30-50% structured generation. constrains model output natively. no more 100-token prompts to get json back. 21,400 stars 8. Aider-AI/aider → replaces pasting entire codebases builds a repo map. sends only files relevant to the task. not your whole project. just what the agent needs. 44,300 stars 9. openai/tiktoken → count tokens before you send know the exact cost before the api call happens. not after the bill arrives. 18,100 stars 10. simonw/ttok → hard cap on what gets sent cli tool: count tokens, truncate to budget limit. pipe any text in. get truncated output back. 389 stars most agents are expensive not because the model is expensive. because nobody checked what was being sent to it.

self.dll

39,475 görüntüleme • 2 ay önce

One thing I wish I did differently when starting out with my cabinet shop was to leverage outsourcing and subcontractors more. I started my business in a 1-car garage at 24 yrs old. I was young, hungry, and felt I needed to prove a point so I committed to purchasing all the equipment I thought I needed to start a woodworking business. Nearly $200k spent in the first 6 months outfitting a 5,000 sf shop with only $15k coming in. I did have cash saved up so the initial investment did not include taking on a ton of debt. But it was a horrible way to allocate the funds. Unlike many trades-based business, it is uncommon to be able to rent equipment for a cabinet shop so I went all in on my own. However, you can 100% rent time on other shop's machines, buy out cabinet doors and drawer boxes, sub out finishing to a painter, sub out delivery, and install, etc. I did none of this when I started out. I bought all the equipment, manufactured all of the components, did all of the finishing, self-performed all deliveries, and did install myself. It was a total grind and it made managing cashflow a nightmare. I ended up becoming a shop other woodworkers paid to rent machine time and that covered a good portion of my bills years 1 & 2. It wasn't until 5 years in that I was beaten down enough by poor cashflow that I started to sub out delivery and installs. We still manufactured all components and did finishing in house as we were focused on high quality and customization and had tooled up in the process of settling in this niche. Which didn't make it economical to outsource with the equipment and staff I had. I don't regret doing things this way because I gained 20 years of experience in 10 years but it certainly made scaling a custom cabinet business 10x more difficult. Sure, the technical aspect of engineering, manufacturing, and installation had a steep learning curve but I loved every minute of that learning curve. The constant pressure that clouded my judgement and ultimately led to burnout was a poor cash conversion cycle and margin compression from equipment loans and overhead. I feel fortunate to have been able to put myself through the ringer for a decade and gained a ton of experience that I can now apply to the next venture but I sure am happy to have been able to exit that business financially unscathed. It was a nail biter nearly every day. The video below inspired this reflection and I think it's a great message to those just starting out with any business, not just excavating or the trades.

One thing I wish I did differently when starting out with my cabinet shop was to leverage outsourcing and subcontractors more. I started my business in a 1-car garage at 24 yrs old. I was young, hungry, and felt I needed to prove a point so I committed to purchasing all the equipment I thought I needed to start a woodworking business. Nearly $200k spent in the first 6 months outfitting a 5,000 sf shop with only $15k coming in. I did have cash saved up so the initial investment did not include taking on a ton of debt. But it was a horrible way to allocate the funds. Unlike many trades-based business, it is uncommon to be able to rent equipment for a cabinet shop so I went all in on my own. However, you can 100% rent time on other shop's machines, buy out cabinet doors and drawer boxes, sub out finishing to a painter, sub out delivery, and install, etc. I did none of this when I started out. I bought all the equipment, manufactured all of the components, did all of the finishing, self-performed all deliveries, and did install myself. It was a total grind and it made managing cashflow a nightmare. I ended up becoming a shop other woodworkers paid to rent machine time and that covered a good portion of my bills years 1 & 2. It wasn't until 5 years in that I was beaten down enough by poor cashflow that I started to sub out delivery and installs. We still manufactured all components and did finishing in house as we were focused on high quality and customization and had tooled up in the process of settling in this niche. Which didn't make it economical to outsource with the equipment and staff I had. I don't regret doing things this way because I gained 20 years of experience in 10 years but it certainly made scaling a custom cabinet business 10x more difficult. Sure, the technical aspect of engineering, manufacturing, and installation had a steep learning curve but I loved every minute of that learning curve. The constant pressure that clouded my judgement and ultimately led to burnout was a poor cash conversion cycle and margin compression from equipment loans and overhead. I feel fortunate to have been able to put myself through the ringer for a decade and gained a ton of experience that I can now apply to the next venture but I sure am happy to have been able to exit that business financially unscathed. It was a nail biter nearly every day. The video below inspired this reflection and I think it's a great message to those just starting out with any business, not just excavating or the trades.

Alex Forbes

42,417 görüntüleme • 5 ay önce

$I just compared Claude Code vs Codex vs Cursor CLI The task was to build a Next.js app with Tailwind 4 and shadcn components to collect customer feedback and showcase it with a widget. I gave all three the same prompt and let them go for 30 minutes to see what they came up with. Claude Code with Opus 4.1 Even though I told it to set up the app in the existing project folder, it tried to create a directory for it. After I interrupted and told it not to do that, it built a demo form and landing page with no errors. I had to ask it to make the demo interactive so users could submit a testimonial and preview it. The landing page looked like AI and was pretty basic, but it worked and it was done in a fraction of the time of the others. Total tokens used: 33k Codex with GPT-5 At the end of the 30 minutes I just could not get Codex to produce a working app. It got stuck in a loop of not being able to set up Tailwind 4 and despite many, MANY, attempts, I ended up with a "failed to compile" error. Total tokens used: 102k Cursor Agent with GPT-5 This was the slowest agent by far and a couple of times I actually thought it got stuck in a loop and was close to Ctrl+C'ing to cancel it. The TUI is really nice though, especially how it shows diffs and it did eventually build a working app (after one or two slight errors that needed fixing) The demo was interactive and it had a very minimal design that looked bare but also a lot less like an "AI generated" app than the Opus 4.1 design. It also wasn't too chatty and just did what it needed to do! Code quality was on a par with Opus 4.1, but it did use 5.5x as many tokens to get there. Still cheaper than Opus on a direct comparison but not when you factor in a Claude Code Max subscription. Total tokens: 188k I'll be able to do a proper comparison and record some videos when I'm back from holiday but for now, Opus is still the more capable model out of the box and Claude Code is the more complete CLI product. It will be interesting to see how Cursor evolve their CLI though with commands and subagents because I think with GPT-5 they have a real shot at providing competition for Claude Code if they can optimise output to get similar quality with less tokens. Jump to 0:40 in the video to see the two apps. Which do you think is which? ;)$

I just compared Claude Code vs Codex vs Cursor CLI The task was to build a Next.js app with Tailwind 4 and shadcn components to collect customer feedback and showcase it with a widget. I gave all three the same prompt and let them go for 30 minutes to see what they came up with. Claude Code with Opus 4.1 Even though I told it to set up the app in the existing project folder, it tried to create a directory for it. After I interrupted and told it not to do that, it built a demo form and landing page with no errors. I had to ask it to make the demo interactive so users could submit a testimonial and preview it. The landing page looked like AI and was pretty basic, but it worked and it was done in a fraction of the time of the others. Total tokens used: 33k Codex with GPT-5 At the end of the 30 minutes I just could not get Codex to produce a working app. It got stuck in a loop of not being able to set up Tailwind 4 and despite many, MANY, attempts, I ended up with a "failed to compile" error. Total tokens used: 102k Cursor Agent with GPT-5 This was the slowest agent by far and a couple of times I actually thought it got stuck in a loop and was close to Ctrl+C'ing to cancel it. The TUI is really nice though, especially how it shows diffs and it did eventually build a working app (after one or two slight errors that needed fixing) The demo was interactive and it had a very minimal design that looked bare but also a lot less like an "AI generated" app than the Opus 4.1 design. It also wasn't too chatty and just did what it needed to do! Code quality was on a par with Opus 4.1, but it did use 5.5x as many tokens to get there. Still cheaper than Opus on a direct comparison but not when you factor in a Claude Code Max subscription. Total tokens: 188k I'll be able to do a proper comparison and record some videos when I'm back from holiday but for now, Opus is still the more capable model out of the box and Claude Code is the more complete CLI product. It will be interesting to see how Cursor evolve their CLI though with commands and subagents because I think with GPT-5 they have a real shot at providing competition for Claude Code if they can optimise output to get similar quality with less tokens. Jump to 0:40 in the video to see the two apps. Which do you think is which? ;)

Ian Nuttall

194,949 görüntüleme • 11 ay önce

U-Net by hand ✍️ ~ 17 steps walkthrough below I consider U-Net as a key milestone in deep learning, the first image-to-image model that really worked! It came out of medical imaging, an unusual place, not from NeurIPS or CVPR or ACL. Now it is the backbone of diffusion models, which you see in almost all modern image generation models. I drew the network as a C so the matrix multiplication flows naturally down. Tilt your head to the right and it is a U again. 🤣 Goal: push a 3 x 16 image down to a 2 x 4 bottleneck and back out again, filling in every cell yourself. = 1. Given = An image of three channels, R, G and B, sixteen pixels wide, and every kernel the network will use. = 2. Convolution 1 = Let us slide the first kernel over the image. Each output is one multiply-and-add over a 2 x 3 window, and the result is the green feature map. = 3. Find the maxima = We circle the largest value in each 1 x 2 window. Circling first is worth the extra step: it is the pooling decision, made before anything is written down. = 4. Max pool 1 = Let us copy those maxima down. Sixteen columns become eight, and half the detail is gone for good. = 5. Convolution 2 = We convolve again with the second kernel, deeper into the contracting path. The feature map is blue now. = 6. Find the maxima again = Same move as step 3, on the blue map. = 7. Max pool 2 = Eight columns become four. = 8. The bottleneck = Let us convolve once more. This is the bottom of the U, a 2 x 4 block that is everything the network kept. = 9. Spread it out = We start back up. The transposed convolution writes each bottleneck value into a wider grid, leaving gaps between them. = 10. Transposed convolution 1 = Let us fill those gaps by convolving over the spread-out grid. Four columns become eight. = 11. The first skip = We copy the encoder's matching row straight across. This is the skip connection, and it is the whole reason a U-Net can recover detail that pooling threw away. = 12. Convolution with the skip = Let us convolve the upsampled features together with the copied ones. = 13. Spread it out again = Same as step 9, one level up. = 14. Transposed convolution 2 = Eight columns become sixteen, back to the width we started at. = 15. The second skip = The encoder's first feature map comes across, the one made before any pooling happened. = 16. Convolution and ReLU = We convolve, then cross out every negative and set it to zero. = 17. Output convolution = Let us apply the last kernel. Out comes R', G' and B', an image the same size as the one we started with. The outputs: R' = [3, 0, 7, 0, 7, 0, 17, 0, 3, 0, 9, 0, 2, 0, 6, 0] G' = [1, 20, 1, 10, 1, 12, 1, 19, 2, 5, 1, 11, 1, 3, 1, 7] B' = [4, 20, 8, 10, 8, 12, 18, 19, 5, 5, 10, 11, 3, 3, 7, 7] Congrats! You just calculated a U-Net by hand. 💾 Save this post!

U-Net by hand ✍️ ~ 17 steps walkthrough below I consider U-Net as a key milestone in deep learning, the first image-to-image model that really worked! It came out of medical imaging, an unusual place, not from NeurIPS or CVPR or ACL. Now it is the backbone of diffusion models, which you see in almost all modern image generation models. I drew the network as a C so the matrix multiplication flows naturally down. Tilt your head to the right and it is a U again. 🤣 Goal: push a 3 x 16 image down to a 2 x 4 bottleneck and back out again, filling in every cell yourself. = 1. Given = An image of three channels, R, G and B, sixteen pixels wide, and every kernel the network will use. = 2. Convolution 1 = Let us slide the first kernel over the image. Each output is one multiply-and-add over a 2 x 3 window, and the result is the green feature map. = 3. Find the maxima = We circle the largest value in each 1 x 2 window. Circling first is worth the extra step: it is the pooling decision, made before anything is written down. = 4. Max pool 1 = Let us copy those maxima down. Sixteen columns become eight, and half the detail is gone for good. = 5. Convolution 2 = We convolve again with the second kernel, deeper into the contracting path. The feature map is blue now. = 6. Find the maxima again = Same move as step 3, on the blue map. = 7. Max pool 2 = Eight columns become four. = 8. The bottleneck = Let us convolve once more. This is the bottom of the U, a 2 x 4 block that is everything the network kept. = 9. Spread it out = We start back up. The transposed convolution writes each bottleneck value into a wider grid, leaving gaps between them. = 10. Transposed convolution 1 = Let us fill those gaps by convolving over the spread-out grid. Four columns become eight. = 11. The first skip = We copy the encoder's matching row straight across. This is the skip connection, and it is the whole reason a U-Net can recover detail that pooling threw away. = 12. Convolution with the skip = Let us convolve the upsampled features together with the copied ones. = 13. Spread it out again = Same as step 9, one level up. = 14. Transposed convolution 2 = Eight columns become sixteen, back to the width we started at. = 15. The second skip = The encoder's first feature map comes across, the one made before any pooling happened. = 16. Convolution and ReLU = We convolve, then cross out every negative and set it to zero. = 17. Output convolution = Let us apply the last kernel. Out comes R', G' and B', an image the same size as the one we started with. The outputs: R' = [3, 0, 7, 0, 7, 0, 17, 0, 3, 0, 9, 0, 2, 0, 6, 0] G' = [1, 20, 1, 10, 1, 12, 1, 19, 2, 5, 1, 11, 1, 3, 1, 7] B' = [4, 20, 8, 10, 8, 12, 18, 19, 5, 5, 10, 11, 3, 3, 7, 7] Congrats! You just calculated a U-Net by hand. 💾 Save this post!

Tom Yeh

16,380 görüntüleme • 3 gün önce

$Japan just bet $16 BILLION on a chip startup that has never shipped a single chip to a paying customer. It's the biggest Hail Mary in modern tech history. But the real problem is about what happens IF it works... On April 11th, Japan approved another $4 billion in subsidies for a company called Rapidus. Total government investment: $16.3 billion. Here's what Rapidus is: Founded in 2022. 4 years old. Zero commercial chips shipped. Zero proven yields. No IPO planned until 2031. Their stated goal: Produce 2-nanometer chips by 2027 at a facility in Hokkaido. For context, 2nm is the absolute bleeding edge of semiconductor manufacturing. Only one company on Earth can do it today: TSMC. TSMC alone is spending $50 BILLION in capital expenditure THIS year. Rapidus has $16 billion TOTAL. To catch up to a company that's been at this for 40 years. In 18 months. Now look at the global picture: United States: CHIPS Act. $280 billion in subsidies. Musk's Terafab project just announced. Bernstein estimates the REAL cost to hit Terafab's stated targets is $5 TRILLION. Europe: €43 billion. China: Entire state apparatus behind SMIC and Huawei. DeepSeek V4 launching on 100% Chinese silicon this month. Japan: $16.3 billion on Rapidus. South Korea: Samsung spending tens of billions to reclaim the 2nm lead. Every major economy on Earth is now treating chip manufacturing as a national security priority. And every single one is building the same thing: Domestic 2nm capacity for AI. But nobody is asking the obvious question... What happens when ALL of them succeed? Right now, Nvidia buys every wafer TSMC can make. "Insatiable demand." Now imagine 2028. TSMC still at full capacity. Samsung has caught up. Intel's 18A is shipping. Rapidus is live. Terafab is online. China's SMIC producing 3nm at scale. Supply has tripled. Has AI demand tripled? Probably not. Model efficiency is improving faster than compute demand. Google's TurboQuant cut memory requirements by 6X with no accuracy loss. DeepSeek proved you can train frontier models for a fraction of the cost. Smaller models are eating bigger ones. The current AI chip shortage isn't permanent. It's a temporary demand spike colliding with a slow supply chain. And governments are now committing hundreds of billions of dollars to solve a problem that might not exist by the time their factories come online. This is how every semiconductor glut in history has started: Governments panic about a shortage. Throw money at capacity. The capacity comes online. Everybody discovers at the same time that demand wasn't what they thought. In 1996, memory prices collapsed 80% when Korean fabs came online. In 2001, the sector lost $300 billion in value. The difference in 2026 is that the bets are bigger than any private company has ever been willing to make. Every one of these bets is the same political statement: We refuse to be dependent on Taiwan. Every government on Earth is making it with taxpayer money. And none of them are coordinating. In 1984, Japan did this exact thing with memory chips. Flooded the market. The entire US memory industry collapsed within 3 years. Then 15 years later, Korea did the same thing to Japan. Then 15 years later, China did it to Korea. The cycle isn't new. What's new is the SCALE. And the fact that nobody wants to be the country that admits the shortage might be temporary. So the money keeps flowing. The fabs keep getting built. And somewhere around 2028, the same analysts currently calling AI chips "insatiable" will start writing articles about the great semiconductor glut of the late 2020s. The question isn't whether Rapidus succeeds. The question is whether ANY of these bets can succeed at the same time. They can't. One of these countries is going to end up holding the bag. Japan is betting $16 billion that it won't be them. What do you think?$

Japan just bet $16 BILLION on a chip startup that has never shipped a single chip to a paying customer. It's the biggest Hail Mary in modern tech history. But the real problem is about what happens IF it works... On April 11th, Japan approved another $4 billion in subsidies for a company called Rapidus. Total government investment: $16.3 billion. Here's what Rapidus is: Founded in 2022. 4 years old. Zero commercial chips shipped. Zero proven yields. No IPO planned until 2031. Their stated goal: Produce 2-nanometer chips by 2027 at a facility in Hokkaido. For context, 2nm is the absolute bleeding edge of semiconductor manufacturing. Only one company on Earth can do it today: TSMC. TSMC alone is spending $50 BILLION in capital expenditure THIS year. Rapidus has $16 billion TOTAL. To catch up to a company that's been at this for 40 years. In 18 months. Now look at the global picture: United States: CHIPS Act. $280 billion in subsidies. Musk's Terafab project just announced. Bernstein estimates the REAL cost to hit Terafab's stated targets is $5 TRILLION. Europe: €43 billion. China: Entire state apparatus behind SMIC and Huawei. DeepSeek V4 launching on 100% Chinese silicon this month. Japan: $16.3 billion on Rapidus. South Korea: Samsung spending tens of billions to reclaim the 2nm lead. Every major economy on Earth is now treating chip manufacturing as a national security priority. And every single one is building the same thing: Domestic 2nm capacity for AI. But nobody is asking the obvious question... What happens when ALL of them succeed? Right now, Nvidia buys every wafer TSMC can make. "Insatiable demand." Now imagine 2028. TSMC still at full capacity. Samsung has caught up. Intel's 18A is shipping. Rapidus is live. Terafab is online. China's SMIC producing 3nm at scale. Supply has tripled. Has AI demand tripled? Probably not. Model efficiency is improving faster than compute demand. Google's TurboQuant cut memory requirements by 6X with no accuracy loss. DeepSeek proved you can train frontier models for a fraction of the cost. Smaller models are eating bigger ones. The current AI chip shortage isn't permanent. It's a temporary demand spike colliding with a slow supply chain. And governments are now committing hundreds of billions of dollars to solve a problem that might not exist by the time their factories come online. This is how every semiconductor glut in history has started: Governments panic about a shortage. Throw money at capacity. The capacity comes online. Everybody discovers at the same time that demand wasn't what they thought. In 1996, memory prices collapsed 80% when Korean fabs came online. In 2001, the sector lost $300 billion in value. The difference in 2026 is that the bets are bigger than any private company has ever been willing to make. Every one of these bets is the same political statement: We refuse to be dependent on Taiwan. Every government on Earth is making it with taxpayer money. And none of them are coordinating. In 1984, Japan did this exact thing with memory chips. Flooded the market. The entire US memory industry collapsed within 3 years. Then 15 years later, Korea did the same thing to Japan. Then 15 years later, China did it to Korea. The cycle isn't new. What's new is the SCALE. And the fact that nobody wants to be the country that admits the shortage might be temporary. So the money keeps flowing. The fabs keep getting built. And somewhere around 2028, the same analysts currently calling AI chips "insatiable" will start writing articles about the great semiconductor glut of the late 2020s. The question isn't whether Rapidus succeeds. The question is whether ANY of these bets can succeed at the same time. They can't. One of these countries is going to end up holding the bag. Japan is betting $16 billion that it won't be them. What do you think?

Ricardo

35,151 görüntüleme • 3 ay önce

this video is the CLEAREST explanation of how claude skills + AI agents work and how to use them most people set up an AI agent and wonder why it keeps disappointing them. the context window is everything context is what the model assembles before it takes any action. think of it like everything the agent needs to read before it does anything. the quality of what goes in determines the quality of what comes out. the models are genuinely really good right now. claude and gpt are exceptional. the variable is almost always the context you give them. 1. agent.md files are mostly unnecessary every single line you put in an agent.md file gets added to every single conversation you have with your agent. a 1000 line file is around 7000 tokens burning on every run. the model already knows to use react. it can read your codebase. save the agent.md for proprietary information specific to your company that the model genuinely cannot know on its own. 2. skills are the actual unlock a skill.md file works differently. what loads into context is only the name and description, around 50 tokens. the full instructions only appear when the agent recognizes it needs that skill. so instead of 7000 tokens on every run you have 50. and the agent stays sharp because the context window stays lean. the closer you get to filling the context window the worse the agent performs, same way you perform worse when someone dumps 10 things on you at once. 3. here is how to actually build a skill the right way most people identify a workflow and immediately try to write the skill. what you want to do instead is run the workflow by hand with the agent first. walk it through every single step. tell it what to check, what good looks like, what bad looks like. correct it in real time. once you have had a full successful run from start to finish, tell the agent to review everything it just did and write the skill itself. it writes a better skill than you will because it has the full context of what actually worked in practice not in theory. 4. recursively building skills is how you go from frustrated to reliable when the skill breaks, and it will break, ask the agent exactly why it failed. it will tell you specifically what went wrong. fix it together in that same conversation. then tell it to update the skill file so that failure mode never happens again. ross mike did this five times with his youtube report generator. it now pulls from eight different data sources and runs flawlessly every single time without him touching it. 5. sub agents are something you earn not something you set up on day one start with one agent. build one workflow. turn it into one skill. once that works add another. ross mike has five sub agents now covering marketing, business, personal and more. it took months to get there and every single one exists because a workflow proved it deserved to exist. the people who set up 15 sub agents on day one and wonder why nothing works skipped all the steps that make the thing actually run. 6. your workflow is the thing the model cannot get anywhere else the model has been trained on everything. it knows more than you about most things. what it does not have is your specific process, your taste, your way of doing things. that is what skills capture. that is what makes your agent actually useful versus a generic one. downloading someone else's skill means downloading their context onto your setup and it will not work the way you want it to because it was never built around how you work. this is the clearest explanation of how agents actually work i have heard. Micky runs this stuff every single day and the results show it. full episode is now live on The Startup Ideas Podcast (SIP) 🧃 where you get your pods people charge for this sorta stuff i give away the sauce for free i just want you to win watch

this video is the CLEAREST explanation of how claude skills + AI agents work and how to use them most people set up an AI agent and wonder why it keeps disappointing them. the context window is everything context is what the model assembles before it takes any action. think of it like everything the agent needs to read before it does anything. the quality of what goes in determines the quality of what comes out. the models are genuinely really good right now. claude and gpt are exceptional. the variable is almost always the context you give them. 1. agent.md files are mostly unnecessary every single line you put in an agent.md file gets added to every single conversation you have with your agent. a 1000 line file is around 7000 tokens burning on every run. the model already knows to use react. it can read your codebase. save the agent.md for proprietary information specific to your company that the model genuinely cannot know on its own. 2. skills are the actual unlock a skill.md file works differently. what loads into context is only the name and description, around 50 tokens. the full instructions only appear when the agent recognizes it needs that skill. so instead of 7000 tokens on every run you have 50. and the agent stays sharp because the context window stays lean. the closer you get to filling the context window the worse the agent performs, same way you perform worse when someone dumps 10 things on you at once. 3. here is how to actually build a skill the right way most people identify a workflow and immediately try to write the skill. what you want to do instead is run the workflow by hand with the agent first. walk it through every single step. tell it what to check, what good looks like, what bad looks like. correct it in real time. once you have had a full successful run from start to finish, tell the agent to review everything it just did and write the skill itself. it writes a better skill than you will because it has the full context of what actually worked in practice not in theory. 4. recursively building skills is how you go from frustrated to reliable when the skill breaks, and it will break, ask the agent exactly why it failed. it will tell you specifically what went wrong. fix it together in that same conversation. then tell it to update the skill file so that failure mode never happens again. ross mike did this five times with his youtube report generator. it now pulls from eight different data sources and runs flawlessly every single time without him touching it. 5. sub agents are something you earn not something you set up on day one start with one agent. build one workflow. turn it into one skill. once that works add another. ross mike has five sub agents now covering marketing, business, personal and more. it took months to get there and every single one exists because a workflow proved it deserved to exist. the people who set up 15 sub agents on day one and wonder why nothing works skipped all the steps that make the thing actually run. 6. your workflow is the thing the model cannot get anywhere else the model has been trained on everything. it knows more than you about most things. what it does not have is your specific process, your taste, your way of doing things. that is what skills capture. that is what makes your agent actually useful versus a generic one. downloading someone else's skill means downloading their context onto your setup and it will not work the way you want it to because it was never built around how you work. this is the clearest explanation of how agents actually work i have heard. Micky runs this stuff every single day and the results show it. full episode is now live on The Startup Ideas Podcast (SIP) 🧃 where you get your pods people charge for this sorta stuff i give away the sauce for free i just want you to win watch

GREG ISENBERG

193,219 görüntüleme • 3 ay önce

Three days ago I asked myself a dumb question. It was so stupid I was actually ashamed to Google it. Can AI earn money while I sleep? Not saving time. Not automating routine. I mean putting real money into my account while I am not looking at the screen. Everyone says ClawdBot will change how we work. Automation. Task management. Smart replies. But I was sitting in my kitchen thinking about something else entirely. You know that feeling when you look at a tool and realize everyone is using only 1% of its potential? It is like being given a race car and only using it to drive to the store for bread. I decided to test it. I started a notebook. I record everything. > Day One I started with something simple. I gave Clawdbot a task. Find wallets on Polymarket where the numbers do not add up. Where the profit is too high for the win rate. Where the result smells like a system rather than luck. It thought for 14 minutes. I had time to pour a coffee and forget about it. Then the screen flashed. 4 addresses. I scrolled through the first three in a minute. Big bets on politics. They guessed the election. Classic. On the fourth one I stopped. Not because it was the most profitable but because I did not understand what I was looking at. The wallet was not trading politics or sports or anything people write reviews about. It was trading the weather. I read it three times. Weather. Will it be 9 degrees in London tomorrow? Will it rain in Tokyo? These are markets I would not even click on by accident. Then I looked at the numbers. > It started with $27. It is now at $63,853. $27 is two trips to McDonald's. It is nothing. $63,853 is a new car or a down payment on an apartment. It is two years of someone's salary. Between those two numbers was only one thing. Thousands of bets on rain. I closed the tab. Opened it again. Checked if it was a glitch. Real dollars. On markets that look like a bad joke. > Day Two I could not get that wallet out of my head. I went to look at its transaction history. I expected to find one big win that explained everything. A lucky hurricane forecast. Instead I saw thousands of small bets. Boring. "Will the temperature in New York be above 15 degrees?" Then I noticed the detail that finally broke my brain. Its win rate: 33%. It loses more often than it wins. 2 out of 3 bets go to zero. Any normal person with that result would be posting about how the market is unfair. Yet this wallet is sitting on $63,000 in profit. How? I started deconstructing the trades. After an hour I got it. When it loses, it loses 10 or 20 cents. When it wins, it takes $1.00. Loses 9 times in a row? Lost $1.80. Wins 1 time? Got $10.00. > This is not trading. It is math that works as long as you do not interfere with your emotions. Here is how it works. Weather is one of the most predictable things on the planet. Governments invest billions in satellites. Data is updated every 2 or 3 hours. Precision to a tenth of a degree. This data is public. But Polymarket is not a weather station. It updates its markets with a delay of 6 or 8 hours. Imagine the situation. 6 AM. The weather service updated the forecast. The probability that London reaches 9 degrees tomorrow rose to 80%. Algorithms everywhere already recalculated the data. But on Polymarket the YES button is still sitting there for 10 cents. Because the market has not woken up yet. This bot sees the difference. It buys YES for 10 cents when the real probability is already 80%. It is not guessing. It is buying what is essentially already known. It just waits a day and collects the dollar. 10 cents turn into a dollar. On information available to anyone who can read weather APIs. That evening I called a friend. He has been trading for 3 years. He sits in analytical chats. Draws support levels. I asked him: "How was the last month?" "I broke even. The market is tough right now. Too much noise." I looked at the screen. A bot betting on rain with a 33% win rate. Profit: $63,853. My friend with 3 years of experience and hundreds of hours of analysis. Profit: $0. Who is doing it wrong? I am not asking you to take my word for it. The blockchain does not lie: > Day Three I decided to dig deeper. I looked at the wallet description. I expected something complex. A hedge fund. A team of developers. Secret data sources. I found one line: Claude plus public weather APIs. Ordinary Claude. The one on your phone. Connected to free weather services. No secret stations. No insiders. No millions for infrastructure. Just an AI doing what any of us could do. But we are too lazy. Or bored. Or we think it is too simple to work. If someone already built this with basic Claude and free APIs... What happens when Clawdbot gets direct access to trading? > Day Four I watched the wallet in real time. First bet: loss. Second bet: loss. Third bet: loss. I thought: this is it. The statistics are collapsing. Fourth bet: loss. Fifth bet: loss. Down $12 in an hour. I was ready to write a post about how I overestimated this. Sixth bet: Temperature in Chicago. Win. +$87. Seventh bet: Win. +$94. By evening: 9 losses. 5 wins. Daily total: +$385. No emotions. No posts about injustice. No strategy changes after a loss. Just the next bet. I wrote to my friend. The one who has been trading for 3 years. "How was your day?" "Down $200. Market makers caught my stop loss again." I looked at the screen. A bot with no posts and no loud claims. +$385 for the day on rain bets. My friend with 3 years of experience and dozens of books. Minus $200 and a post about how the system is against him. > Day Five I woke up with a thought that kept me up all night. It finally hit me. It is not about the weather. It is not about APIs. It is not that the bot is "smarter". > It is about what the bot does NOT have: an ego that hates being wrong. No urge to revenge-trade. No boredom from repetition. My friend trades against the market. He tries to be smarter than the crowd. This bot trades against human nature. And nature loses every day. Clawdbot found me this wallet in 14 minutes. The weather bot turned $27 into $63,000 on markets everyone else thinks are trash. Both use the same principle. Do something simple. Remove emotions. Repeat. I do not know when Clawdbot will start trading on its own. Maybe in a month. Maybe in a year. But I know one thing. While we discuss if it is possible... Someone already set up their bot and went to live their life. Right now as you read this. Somewhere a weather service updated a forecast. Polymarket is sleeping. The bot is already entering a position. And my friend is writing a post about how market makers do not let honest people earn. Guess who wakes up tomorrow with money in their account?

Three days ago I asked myself a dumb question. It was so stupid I was actually ashamed to Google it. Can AI earn money while I sleep? Not saving time. Not automating routine. I mean putting real money into my account while I am not looking at the screen. Everyone says ClawdBot will change how we work. Automation. Task management. Smart replies. But I was sitting in my kitchen thinking about something else entirely. You know that feeling when you look at a tool and realize everyone is using only 1% of its potential? It is like being given a race car and only using it to drive to the store for bread. I decided to test it. I started a notebook. I record everything. > Day One I started with something simple. I gave Clawdbot a task. Find wallets on Polymarket where the numbers do not add up. Where the profit is too high for the win rate. Where the result smells like a system rather than luck. It thought for 14 minutes. I had time to pour a coffee and forget about it. Then the screen flashed. 4 addresses. I scrolled through the first three in a minute. Big bets on politics. They guessed the election. Classic. On the fourth one I stopped. Not because it was the most profitable but because I did not understand what I was looking at. The wallet was not trading politics or sports or anything people write reviews about. It was trading the weather. I read it three times. Weather. Will it be 9 degrees in London tomorrow? Will it rain in Tokyo? These are markets I would not even click on by accident. Then I looked at the numbers. > It started with $27. It is now at $63,853. $27 is two trips to McDonald's. It is nothing. $63,853 is a new car or a down payment on an apartment. It is two years of someone's salary. Between those two numbers was only one thing. Thousands of bets on rain. I closed the tab. Opened it again. Checked if it was a glitch. Real dollars. On markets that look like a bad joke. > Day Two I could not get that wallet out of my head. I went to look at its transaction history. I expected to find one big win that explained everything. A lucky hurricane forecast. Instead I saw thousands of small bets. Boring. "Will the temperature in New York be above 15 degrees?" Then I noticed the detail that finally broke my brain. Its win rate: 33%. It loses more often than it wins. 2 out of 3 bets go to zero. Any normal person with that result would be posting about how the market is unfair. Yet this wallet is sitting on $63,000 in profit. How? I started deconstructing the trades. After an hour I got it. When it loses, it loses 10 or 20 cents. When it wins, it takes $1.00. Loses 9 times in a row? Lost $1.80. Wins 1 time? Got $10.00. > This is not trading. It is math that works as long as you do not interfere with your emotions. Here is how it works. Weather is one of the most predictable things on the planet. Governments invest billions in satellites. Data is updated every 2 or 3 hours. Precision to a tenth of a degree. This data is public. But Polymarket is not a weather station. It updates its markets with a delay of 6 or 8 hours. Imagine the situation. 6 AM. The weather service updated the forecast. The probability that London reaches 9 degrees tomorrow rose to 80%. Algorithms everywhere already recalculated the data. But on Polymarket the YES button is still sitting there for 10 cents. Because the market has not woken up yet. This bot sees the difference. It buys YES for 10 cents when the real probability is already 80%. It is not guessing. It is buying what is essentially already known. It just waits a day and collects the dollar. 10 cents turn into a dollar. On information available to anyone who can read weather APIs. That evening I called a friend. He has been trading for 3 years. He sits in analytical chats. Draws support levels. I asked him: "How was the last month?" "I broke even. The market is tough right now. Too much noise." I looked at the screen. A bot betting on rain with a 33% win rate. Profit: $63,853. My friend with 3 years of experience and hundreds of hours of analysis. Profit: $0. Who is doing it wrong? I am not asking you to take my word for it. The blockchain does not lie: > Day Three I decided to dig deeper. I looked at the wallet description. I expected something complex. A hedge fund. A team of developers. Secret data sources. I found one line: Claude plus public weather APIs. Ordinary Claude. The one on your phone. Connected to free weather services. No secret stations. No insiders. No millions for infrastructure. Just an AI doing what any of us could do. But we are too lazy. Or bored. Or we think it is too simple to work. If someone already built this with basic Claude and free APIs... What happens when Clawdbot gets direct access to trading? > Day Four I watched the wallet in real time. First bet: loss. Second bet: loss. Third bet: loss. I thought: this is it. The statistics are collapsing. Fourth bet: loss. Fifth bet: loss. Down $12 in an hour. I was ready to write a post about how I overestimated this. Sixth bet: Temperature in Chicago. Win. +$87. Seventh bet: Win. +$94. By evening: 9 losses. 5 wins. Daily total: +$385. No emotions. No posts about injustice. No strategy changes after a loss. Just the next bet. I wrote to my friend. The one who has been trading for 3 years. "How was your day?" "Down $200. Market makers caught my stop loss again." I looked at the screen. A bot with no posts and no loud claims. +$385 for the day on rain bets. My friend with 3 years of experience and dozens of books. Minus $200 and a post about how the system is against him. > Day Five I woke up with a thought that kept me up all night. It finally hit me. It is not about the weather. It is not about APIs. It is not that the bot is "smarter". > It is about what the bot does NOT have: an ego that hates being wrong. No urge to revenge-trade. No boredom from repetition. My friend trades against the market. He tries to be smarter than the crowd. This bot trades against human nature. And nature loses every day. Clawdbot found me this wallet in 14 minutes. The weather bot turned $27 into $63,000 on markets everyone else thinks are trash. Both use the same principle. Do something simple. Remove emotions. Repeat. I do not know when Clawdbot will start trading on its own. Maybe in a month. Maybe in a year. But I know one thing. While we discuss if it is possible... Someone already set up their bot and went to live their life. Right now as you read this. Somewhere a weather service updated a forecast. Polymarket is sleeping. The bot is already entering a position. And my friend is writing a post about how market makers do not let honest people earn. Guess who wakes up tomorrow with money in their account?

Blaze

29,808 görüntüleme • 6 ay önce