Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

But the KV cache is created for each transformer layer. By sending each layer’s KV cache after it’s computed, we overlap communication with computation. We stream the KV cache and hide the network delay. We achieve a 4x speedup in prefill & 3x in decode, with 0 network delay.

EXO Labs

52,413 subscribers

50,617 Aufrufe • vor 9 Monaten •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Day 11/90 of Inference Engineering How does vLLM work and how is it used in production? Before we discuss how vLLM works internally, it helps to understand what vLLM is. At a high level, vLLM is an inference engine that is designed to serve LLMs to thousands of concurrent users efficiently while managing scarce compute and memory. The goal for vLLM is to maximize throughput and minimize latency; optimizing for the best inference economics and experience for end users. With every request from the end user, it eventually ends up in the engine core, gets scheduled alongside other requests from other concurrent users, executes on the GPU, and updates the KV cache with the new key and value vectors, and streams the tokens back to the user. The Scheduler decides what requests should execute next while continuously batching requests together to maximize GPU utilization. Continuous batching is an inference optimization that allows new requests to join a running batch as other requests finish generating tokens. This helps with keeping the GPU utilization high instead of letting it sit idle waiting for an entire batch to complete generating. After the scheduler dispatches the selected batch to the Model Executor, the Model Executor prepares the tensors and metadata required for inference, retrieves each request’s block table from KV Cache Manager, launches the optimized transformer forward pass on the GPU, computes the logits, updates the KV cache with the new key and value vectors, and finally returns the results for sampling and streaming. The KV Cache Manager uses the PagedAttention memory layout to allocate fixed-size cache blocks on demand and maintains a Free Block Queue on the CPU that tracks which blocks in the GPU’s Paged KV Cache are currently free. When a request needs additional KV cache space, the KV Cache manager takes a free block from the queue and assigns it to that request, thus avoiding an expensive search through GPU memory for available cache blocks. All of these components form the core of vLLM’s inference engine. The Scheduler determines what requests are executed, the Model Executor determines how those requests are executed, the KV Cache Manager determines where each request’s KV cache lives using the PagedAttention Memory Layout. This architecture enables vLLM to serve thousands of concurrent requests with high throughput, low latency, and efficient GPU memory utilization. Heres a little animation that visualizes everything! - I've also completed the forward pass for my mnist.c project. I had a nice chat with shrey birmiwal, such a knowledgeable guy. Excited to learn more about vLLM and implement a tiny-vLLM one day.

Day 11/90 of Inference Engineering How does vLLM work and how is it used in production? Before we discuss how vLLM works internally, it helps to understand what vLLM is. At a high level, vLLM is an inference engine that is designed to serve LLMs to thousands of concurrent users efficiently while managing scarce compute and memory. The goal for vLLM is to maximize throughput and minimize latency; optimizing for the best inference economics and experience for end users. With every request from the end user, it eventually ends up in the engine core, gets scheduled alongside other requests from other concurrent users, executes on the GPU, and updates the KV cache with the new key and value vectors, and streams the tokens back to the user. The Scheduler decides what requests should execute next while continuously batching requests together to maximize GPU utilization. Continuous batching is an inference optimization that allows new requests to join a running batch as other requests finish generating tokens. This helps with keeping the GPU utilization high instead of letting it sit idle waiting for an entire batch to complete generating. After the scheduler dispatches the selected batch to the Model Executor, the Model Executor prepares the tensors and metadata required for inference, retrieves each request’s block table from KV Cache Manager, launches the optimized transformer forward pass on the GPU, computes the logits, updates the KV cache with the new key and value vectors, and finally returns the results for sampling and streaming. The KV Cache Manager uses the PagedAttention memory layout to allocate fixed-size cache blocks on demand and maintains a Free Block Queue on the CPU that tracks which blocks in the GPU’s Paged KV Cache are currently free. When a request needs additional KV cache space, the KV Cache manager takes a free block from the queue and assigns it to that request, thus avoiding an expensive search through GPU memory for available cache blocks. All of these components form the core of vLLM’s inference engine. The Scheduler determines what requests are executed, the Model Executor determines how those requests are executed, the KV Cache Manager determines where each request’s KV cache lives using the PagedAttention Memory Layout. This architecture enables vLLM to serve thousands of concurrent requests with high throughput, low latency, and efficient GPU memory utilization. Heres a little animation that visualizes everything! - I've also completed the forward pass for my mnist.c project. I had a nice chat with shrey birmiwal, such a knowledgeable guy. Excited to learn more about vLLM and implement a tiny-vLLM one day.

max fu

69,880 Aufrufe • vor 13 Tagen

🎥 Video generation is hitting the memory wall. As videos get longer, the KV cache quietly explodes — and long-horizon consistency starts to break. We built Quant VideoGen: a training-free KV cache compression method for auto-regressive video diffusion. Instead of storing every KV in high precision, QVG exploits video’s spatiotemporal redundancy with semantic-aware smoothing + progressive residual quantization. 🚀 Up to 7× KV memory reduction ⚡ <4% overhead ✅ Strong long-video quality 🕹️ Deploy HYWorldPlay on your own RTX 5090 locally KV compression is becoming a core scaling primitive — not just for LLMs, but for video generation too. Paper: Code: (1/5)

🎥 Video generation is hitting the memory wall. As videos get longer, the KV cache quietly explodes — and long-horizon consistency starts to break. We built Quant VideoGen: a training-free KV cache compression method for auto-regressive video diffusion. Instead of storing every KV in high precision, QVG exploits video’s spatiotemporal redundancy with semantic-aware smoothing + progressive residual quantization. 🚀 Up to 7× KV memory reduction ⚡ <4% overhead ✅ Strong long-video quality 🕹️ Deploy HYWorldPlay on your own RTX 5090 locally KV compression is becoming a core scaling primitive — not just for LLMs, but for video generation too. Paper: Code: (1/5)

Haocheng Xi

65,008 Aufrufe • vor 3 Monaten

🚀 Self-speculation brings 6.75x real speedup for LLM generation with SGLang inference! Same model drafts future tokens in Diffusion mode → then verifies them in AR (causal) mode. One model and one KV cache. Just different attention masks. Thanks to perfect alignment, we get 2× longer acceptance lengths than MTP techniques (Eagle-3, MTP, dFlash). We run 2 forward passes… but the 2× higher acceptance means we break even - and with zero overhead from extra drafter, KV cache, or LM head that comes with MTP - those are not free. Last week we released Nemotron-Labs-Diffusion + Tri-mode LLMs! We did continued pre-training on Ministral-3 models by switching attention patterns (block causal bidirectional). Result: one model that runs AR mode, Diffusion mode, and Self-Speculation. Diffusion mode already shows high benchmark accuracy - excited to see what happens when someone beats left-to-right acceptance! 🔥 Github: Paper: SGLang inference: Try the models on HF:

🚀 Self-speculation brings 6.75x real speedup for LLM generation with SGLang inference! Same model drafts future tokens in Diffusion mode → then verifies them in AR (causal) mode. One model and one KV cache. Just different attention masks. Thanks to perfect alignment, we get 2× longer acceptance lengths than MTP techniques (Eagle-3, MTP, dFlash). We run 2 forward passes… but the 2× higher acceptance means we break even - and with zero overhead from extra drafter, KV cache, or LM head that comes with MTP - those are not free. Last week we released Nemotron-Labs-Diffusion + Tri-mode LLMs! We did continued pre-training on Ministral-3 models by switching attention patterns (block causal bidirectional). Result: one model that runs AR mode, Diffusion mode, and Self-Speculation. Diffusion mode already shows high benchmark accuracy - excited to see what happens when someone beats left-to-right acceptance! 🔥 Github: Paper: SGLang inference: Try the models on HF:

Pavlo Molchanov

66,554 Aufrufe • vor 2 Monaten

Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM: Prefill: 1000+ tok/s (42% increase) Decode: 25+ tok/s (25% increase) Context: 120k (150% increase) prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments) llama.cpp TurboQuant flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --port 8080 tested with a 27k prompt, 120k context loaded. -ngl 99 here isn't a typo, full 12B dense, every layer on GPU, on an 8GB card. that's the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card. TurboQuant's KV cache savings are what free up the room to do that at 120k context. side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+ rig: RTX 4060 8GB · i7H · 16GB RAM same two flags as yesterday, different model size: --cache-type-k q8_0 --cache-type-v turbo3 thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (Tom Turney) to make this work. unsloth's model quant huggingface and the llama.cpp fork github link in the comments Do you prefer a dense or a MoE for your 8GB card?

Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM: Prefill: 1000+ tok/s (42% increase) Decode: 25+ tok/s (25% increase) Context: 120k (150% increase) prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments) llama.cpp TurboQuant flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --port 8080 tested with a 27k prompt, 120k context loaded. -ngl 99 here isn't a typo, full 12B dense, every layer on GPU, on an 8GB card. that's the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card. TurboQuant's KV cache savings are what free up the room to do that at 120k context. side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+ rig: RTX 4060 8GB · i7H · 16GB RAM same two flags as yesterday, different model size: --cache-type-k q8_0 --cache-type-v turbo3 thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (Tom Turney) to make this work. unsloth's model quant huggingface and the llama.cpp fork github link in the comments Do you prefer a dense or a MoE for your 8GB card?

Alok

34,500 Aufrufe • vor 1 Monat

Robots can now reconstruct 3D scenes in real time from a single RGB camera. [📍 Projects page + paper] No depth sensor. No retraining. 30 FPS. Researchers at the Imperial College London introduced KV-Tracker, a training-free method that makes heavy models like π³ and Depth Anything 3 fast enough for real-time tracking. The idea is simple. These models use global self-attention, which is powerful but computationally expensive. KV-Tracker caches the key and value pairs from selected keyframes and reuses them for new frames. That cache becomes an implicit scene representation. Result: • Up to 30 FPS • 10 to 15x speedup • Accurate 6-DoF tracking on benchmarks like TUM RGB-D and 7-Scenes • Works with monocular RGB only It also supports object-level tracking with masks and allows saving the KV-cache for later reuse. For robotics, this reduces hardware constraints and moves real-time 3D perception closer to practical deployment. Credit to Marwan Taher (Marwan Taher) at Imperial’s Dyson Robotics Lab and many others who contributed to this! 📍 Save projects page + paper for later: Video: ——- if it matters in AI or Robotics you'll read it here first:

Robots can now reconstruct 3D scenes in real time from a single RGB camera. [📍 Projects page + paper] No depth sensor. No retraining. 30 FPS. Researchers at the Imperial College London introduced KV-Tracker, a training-free method that makes heavy models like π³ and Depth Anything 3 fast enough for real-time tracking. The idea is simple. These models use global self-attention, which is powerful but computationally expensive. KV-Tracker caches the key and value pairs from selected keyframes and reuses them for new frames. That cache becomes an implicit scene representation. Result: • Up to 30 FPS • 10 to 15x speedup • Accurate 6-DoF tracking on benchmarks like TUM RGB-D and 7-Scenes • Works with monocular RGB only It also supports object-level tracking with masks and allows saving the KV-cache for later reuse. For robotics, this reduces hardware constraints and moves real-time 3D perception closer to practical deployment. Credit to Marwan Taher (Marwan Taher) at Imperial’s Dyson Robotics Lab and many others who contributed to this! 📍 Save projects page + paper for later: Video: ——- if it matters in AI or Robotics you'll read it here first:

Ilir Aliu

53,911 Aufrufe • vor 3 Monaten

🌐 The #Streamr Network 1.0 mainnet is GO for launch in 48 hours! 🚀 After 6+ years in research & development, get ready for a fully decentralized, feature-complete data broadcasting network, powered by its users. We hope you’re as excited as we are! 🧡

🌐 The #Streamr Network 1.0 mainnet is GO for launch in 48 hours! 🚀 After 6+ years in research & development, get ready for a fully decentralized, feature-complete data broadcasting network, powered by its users. We hope you’re as excited as we are! 🧡

Streamr Network

17,923 Aufrufe • vor 2 Jahren

gemma-4-12B-agentic-fable5-composer2.5 V2 is out. the agentic upgrade to the model trained on Fable 5's reasoning. Running it now with TurboQuant llama.cpp on a single RTX 4060( 8 GB VRAM) at 30 tokens/second with full 25000 context and reasoning: # The benchmarks v2 is built for coding + agentic work. writing code, running commands, using tools, debugging, multi step technical tasks. The clearest signal is tau2 bench telecom, an agentic tool use benchmark whose diagnose → fix → verify loop mirrors real terminal/debugging work: tau2 bench telecom numbers: base Gemma 4 12B: ~15% this finetune: ~55%. (Self reported) thats a huge jump # TheTom/llama-cpp-turboquant flags: llama-server.exe -m gemma4-v2-Q4_K_M.gguf -ngl 99 -c 25000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 Flag breakdown: -ngl 99 → full GPU offload -c 25000 → 25K context --cache-type-k q8_0 --cache-type-v turbo3 → mixed-precision KV cache — K at 8-bit, V at ~3-bit via TurboQuant (Walsh Hadamard rotated polar quant, Google's own KV-compression research). Not even merged into mainline llama.cpp. running it off a fork. No API. No cloud. Just llama.cpp. well, a fork of it and any 6gb+ GPU. If you tried yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF, check this out and share your experience with the models

gemma-4-12B-agentic-fable5-composer2.5 V2 is out. the agentic upgrade to the model trained on Fable 5's reasoning. Running it now with TurboQuant llama.cpp on a single RTX 4060( 8 GB VRAM) at 30 tokens/second with full 25000 context and reasoning: # The benchmarks v2 is built for coding + agentic work. writing code, running commands, using tools, debugging, multi step technical tasks. The clearest signal is tau2 bench telecom, an agentic tool use benchmark whose diagnose → fix → verify loop mirrors real terminal/debugging work: tau2 bench telecom numbers: base Gemma 4 12B: ~15% this finetune: ~55%. (Self reported) thats a huge jump # TheTom/llama-cpp-turboquant flags: llama-server.exe -m gemma4-v2-Q4_K_M.gguf -ngl 99 -c 25000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 Flag breakdown: -ngl 99 → full GPU offload -c 25000 → 25K context --cache-type-k q8_0 --cache-type-v turbo3 → mixed-precision KV cache — K at 8-bit, V at ~3-bit via TurboQuant (Walsh Hadamard rotated polar quant, Google's own KV-compression research). Not even merged into mainline llama.cpp. running it off a fork. No API. No cloud. Just llama.cpp. well, a fork of it and any 6gb+ GPU. If you tried yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF, check this out and share your experience with the models

Alok

145,356 Aufrufe • vor 1 Monat

#WATCH | Delhi | After attending the dinner hosted by Rajya Sabha LoP Mallikarjun Kharge for the INDIA bloc leaders, Shiv Sena (UBT) MP Priyanka Chaturvedi says, "We socialised with each other keeping politics aside, and talked about family matters. We met each other in a good atmosphere. The unity of the INDIA Alliance is in front of you all."

#WATCH | Delhi | After attending the dinner hosted by Rajya Sabha LoP Mallikarjun Kharge for the INDIA bloc leaders, Shiv Sena (UBT) MP Priyanka Chaturvedi says, "We socialised with each other keeping politics aside, and talked about family matters. We met each other in a good atmosphere. The unity of the INDIA Alliance is in front of you all."

ANI

58,515 Aufrufe • vor 11 Monaten

Day 12/90 of Inference Engineering What is chunked prefill within vLLM? In continuation of yesterday's post on the high level architecture of vLLM, I want to dive deeper into vLLM core engine starting with the mechanics of chunked prefill. In this post, I will closely follow the original blog on the anatomy of vLLM. To start, let's define chunked prefill. It's a runtime inference optimization technique that splits a long input request so that it doesn’t monopolize the whole GPU. Keep in mind this is all within the context of vLLM. And since vLLM is an inference engine that's meant to serve a model to multiple concurrent users, having a GPU that’s fully monopolized on a single user's request means other users' requests would be in queue waiting to be processed. It isn’t too good to have the whole GPU occupied on a single request when the GPU is meant to be shared! So the key idea behind chunked prefill is to break the long request into smaller chunks, so that each chunk along with other users' requests gets processed and written into the KV cache together. Suppose we split up the long request into chunks and each chunk has 8 tokens. Now each memory block can hold 4 tokens. Therefore, 8 tokens can fit into 2 blocks of memory. After the first forward pass, 2 blocks are occupied, and after the second forward pass, 4 blocks of memory are occupied and so forth. Each forward pass handles a small chunk of the long request so that there's room in the same pass to keep serving other users' requests. Here's a small animation that I made today to fully visualize the idea behind chunked prefill when learning this topic~

Day 12/90 of Inference Engineering What is chunked prefill within vLLM? In continuation of yesterday's post on the high level architecture of vLLM, I want to dive deeper into vLLM core engine starting with the mechanics of chunked prefill. In this post, I will closely follow the original blog on the anatomy of vLLM. To start, let's define chunked prefill. It's a runtime inference optimization technique that splits a long input request so that it doesn’t monopolize the whole GPU. Keep in mind this is all within the context of vLLM. And since vLLM is an inference engine that's meant to serve a model to multiple concurrent users, having a GPU that’s fully monopolized on a single user's request means other users' requests would be in queue waiting to be processed. It isn’t too good to have the whole GPU occupied on a single request when the GPU is meant to be shared! So the key idea behind chunked prefill is to break the long request into smaller chunks, so that each chunk along with other users' requests gets processed and written into the KV cache together. Suppose we split up the long request into chunks and each chunk has 8 tokens. Now each memory block can hold 4 tokens. Therefore, 8 tokens can fit into 2 blocks of memory. After the first forward pass, 2 blocks are occupied, and after the second forward pass, 4 blocks of memory are occupied and so forth. Each forward pass handles a small chunk of the long request so that there's room in the same pass to keep serving other users' requests. Here's a small animation that I made today to fully visualize the idea behind chunked prefill when learning this topic~

max fu

29,111 Aufrufe • vor 13 Tagen

Today, we’re excited to announce a partnership with Mind Network. Both Nesa and Mind Network specialize in decentralized security, and we share a mission to bring this tech to crypto AI. Together we will explore a close technological collaboration, sharing components of our stacks with one another. Mind Network will also be using Nesa for its AI inference. Mind Network is the first FHE Restaking Layer for AI, focused on enhancing security at the consensus and validator levels of AI networks. Nesa is the Layer-1 blockchain for AI, specializing in private inference and building infrastructure to make it easy for any application, protocol, and smart contract to fuse with AI. Look out for our AMA together and other activations soon.

Today, we’re excited to announce a partnership with Mind Network. Both Nesa and Mind Network specialize in decentralized security, and we share a mission to bring this tech to crypto AI. Together we will explore a close technological collaboration, sharing components of our stacks with one another. Mind Network will also be using Nesa for its AI inference. Mind Network is the first FHE Restaking Layer for AI, focused on enhancing security at the consensus and validator levels of AI networks. Nesa is the Layer-1 blockchain for AI, specializing in private inference and building infrastructure to make it easy for any application, protocol, and smart contract to fuse with AI. Look out for our AMA together and other activations soon.

Nesa

101,923 Aufrufe • vor 2 Jahren

“We are each other’s light, each other’s gold, each other’s hope, forever fragile, and forever valiant, bound by love we will outlive the stars…” xx Call the Midwife will return with a new Christmas Special and Series 15 in 2026! xx BBC One #CallTheMidwife

“We are each other’s light, each other’s gold, each other’s hope, forever fragile, and forever valiant, bound by love we will outlive the stars…” xx Call the Midwife will return with a new Christmas Special and Series 15 in 2026! xx BBC One #CallTheMidwife

Call the Midwife

44,111 Aufrufe • vor 1 Jahr

TODAY’S THE DAY! Our new bus network is OFFICIALLY here! Thank you to our team members, partners, local leaders, & most of all, our customers for going on this ride with us. It’s been a labor of love - years in the making - & we can’t wait for you to get to know the network 🎉

TODAY’S THE DAY! Our new bus network is OFFICIALLY here! Thank you to our team members, partners, local leaders, & most of all, our customers for going on this ride with us. It’s been a labor of love - years in the making - & we can’t wait for you to get to know the network 🎉

Metro Forward

17,449 Aufrufe • vor 1 Jahr

As farmers we live and work around wild animals all the time And most of the time we live in harmony with each other It’s #FarmerFriday so let’s see those wildlife pics

As farmers we live and work around wild animals all the time And most of the time we live in harmony with each other It’s #FarmerFriday so let’s see those wildlife pics

John Kowalchuk🧢

55,747 Aufrufe • vor 2 Jahren

Juan Soto is asked to describe his relationship with Francisco Lindor: "It's a great relationship - we talk all the time in the games, we help each other"

Juan Soto is asked to describe his relationship with Francisco Lindor: "It's a great relationship - we talk all the time in the games, we help each other"

SNY Mets

87,589 Aufrufe • vor 5 Monaten

you're paying $20/mo for something your $500 GPU can already do. Gemma 4 26B A4B QAT MoE + Hermes Agent running on a single RTX 4060 (8GB VRAM). Built a vision capable, 100% free, 100% local, private AI assistant that lives in my Chrome browser. No API keys. No cloud. No subscriptions. 100% vibe coded. 0% handholding. It has full context of whatever's on my screen can answer questions, summarize pages, extract data, and see images. Same local model handles everything, no external calls, ever. keep reading for the model and hermes agent tips i learnt while building this locally. Here's the exact setup for anyone running local LLMs on 6-8 GB VRAM: llama.cpp server flags (on my NVIDIA RTX 4060 8gb VRAM): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --cache-type-k q8_0 --cache-type-v q8_0 -c 150000 --port 8080 Throughput with quantization: Prefill: 200-250 tokens/sec Decode: 20-25 tokens/sec reduce context if oom on 6 gb vram card. Key learnings: - Quantize KV cache to q8 for faster prefill/decode. Prefill goes from 100-150 (unquantized) to 200-250 tok/s (q8). - But watch out, once actual context grows past ~50k tokens on high entropy workloads, q8 KV quantization can cause hallucinations. Low entropy workloads are mostly unaffected. If you see it happening, drop the quantization. This is common across all local models. - In Hermes Agent settings -> Memory & Context, bump compression threshold from default 0.5 to 0.7. Default triggers way too frequent context compression and eats time. Up next: add persistent memory, web search, tool calling, streaming output and whatever you suggest. Running a 26B MoE with vision + 150k context window on 8GB VRAM would've sounded impossible 6 months ago. Works the same on the NVIDIA RTX 3060 Ti, 3070, 4060 Ti, 5060, 2080, or any 8GB card. VRAM is the only requirement. Local AI agents are closer than people think. You just need to know where the knobs are. Model's Unsloth quant hugging face link in the comments. Have you tried Hermes agent by Nous Research yet? What are you building with local LLMs? Drop it below, let's see what this community is shipping.

you're paying $20/mo for something your $500 GPU can already do. Gemma 4 26B A4B QAT MoE + Hermes Agent running on a single RTX 4060 (8GB VRAM). Built a vision capable, 100% free, 100% local, private AI assistant that lives in my Chrome browser. No API keys. No cloud. No subscriptions. 100% vibe coded. 0% handholding. It has full context of whatever's on my screen can answer questions, summarize pages, extract data, and see images. Same local model handles everything, no external calls, ever. keep reading for the model and hermes agent tips i learnt while building this locally. Here's the exact setup for anyone running local LLMs on 6-8 GB VRAM: llama.cpp server flags (on my NVIDIA RTX 4060 8gb VRAM): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --cache-type-k q8_0 --cache-type-v q8_0 -c 150000 --port 8080 Throughput with quantization: Prefill: 200-250 tokens/sec Decode: 20-25 tokens/sec reduce context if oom on 6 gb vram card. Key learnings: - Quantize KV cache to q8 for faster prefill/decode. Prefill goes from 100-150 (unquantized) to 200-250 tok/s (q8). - But watch out, once actual context grows past ~50k tokens on high entropy workloads, q8 KV quantization can cause hallucinations. Low entropy workloads are mostly unaffected. If you see it happening, drop the quantization. This is common across all local models. - In Hermes Agent settings -> Memory & Context, bump compression threshold from default 0.5 to 0.7. Default triggers way too frequent context compression and eats time. Up next: add persistent memory, web search, tool calling, streaming output and whatever you suggest. Running a 26B MoE with vision + 150k context window on 8GB VRAM would've sounded impossible 6 months ago. Works the same on the NVIDIA RTX 3060 Ti, 3070, 4060 Ti, 5060, 2080, or any 8GB card. VRAM is the only requirement. Local AI agents are closer than people think. You just need to know where the knobs are. Model's Unsloth quant hugging face link in the comments. Have you tried Hermes agent by Nous Research yet? What are you building with local LLMs? Drop it below, let's see what this community is shipping.

Alok

36,031 Aufrufe • vor 28 Tagen

6,000 → over 12,000 agent-to-human pairings in the Billions Network In 1 week we doubled the number - rapid growth after our launch 3 weeks ago! Get your AI agent paired with your Billions App profile, Earn Rewards with your OpenClaw agent, and Grow the Billions Network of Humans and AI agents Source: Billions Network Blockchain Explorer

Billions

70,209 Aufrufe • vor 4 Monaten

Batch Normalization by hand ✍️ ~ 7 steps walkthrough below Batch normalization is common practice for improving training and achieving faster convergence. It sounds simple. But it is often misunderstood. 🤔 Does batch normalization involve trainable parameters, tunable hyper-parameters, or both? 🤔 Is batch normalization applied to inputs, features, weights, biases, or outputs? 🤔 How is batch normalization different from layer normalization? So I drew and calculated one entirely by hand. Goal: normalize a mini-batch of 4 examples to mean 0 and variance 1, then let the network scale it back. = 1. Given = A mini-batch of 4 training examples, each with 3 features. = 2. Linear layer = Let us multiply by the weights and add the biases. Batch norm sits after this, which answers the second question: what gets normalized is features, not inputs, weights or biases. = 3. ReLU = We apply the activation, and -2 becomes 0. Negative values are suppressed before any statistic is taken. = 4. Batch statistics = Let us compute the sum, mean, variance and standard deviation, one row at a time. A row is a feature and the four columns are the four examples, so every number here measures one feature against the rest of the batch. That is the "batch" in batch normalization, and it is exactly what layer normalization does not do. The statistics are rounded to whole numbers, which is what keeps the rest of the page doable in pen. = 5. Shift to mean 0 = We subtract the mean, in green. The four values in each feature now average to zero. = 6. Scale to variance 1 = Let us divide by the standard deviation, in orange. Each feature now has variance one, whatever scale it arrived at. = 7. Scale and shift = We multiply by a linear transformation and pass the result on. The diagonal and the last column are trainable, so having just forced every feature to mean 0 and variance 1, we hand the network the means to undo it. The outputs: Mean of each feature = [2, 1, 2] Std dev of each feature = [1, 1, 2] To the next layer = [2, -2, 2, 0], [-3, 3, 6, -3], [2, 0, 1, 2] The answers: 🤔 Both. The scale and shift are trainable, the statistics are not. Epsilon and the momentum on the running statistics are the hyper-parameters, and one mini-batch by hand needs neither. 🤔 Features, after the linear layer, not inputs, weights or biases. 🤔 Batch norm measures across the batch, one feature at a time. Layer norm measures across the features, one example at a time. 💾 Save this post!

Batch Normalization by hand ✍️ ~ 7 steps walkthrough below Batch normalization is common practice for improving training and achieving faster convergence. It sounds simple. But it is often misunderstood. 🤔 Does batch normalization involve trainable parameters, tunable hyper-parameters, or both? 🤔 Is batch normalization applied to inputs, features, weights, biases, or outputs? 🤔 How is batch normalization different from layer normalization? So I drew and calculated one entirely by hand. Goal: normalize a mini-batch of 4 examples to mean 0 and variance 1, then let the network scale it back. = 1. Given = A mini-batch of 4 training examples, each with 3 features. = 2. Linear layer = Let us multiply by the weights and add the biases. Batch norm sits after this, which answers the second question: what gets normalized is features, not inputs, weights or biases. = 3. ReLU = We apply the activation, and -2 becomes 0. Negative values are suppressed before any statistic is taken. = 4. Batch statistics = Let us compute the sum, mean, variance and standard deviation, one row at a time. A row is a feature and the four columns are the four examples, so every number here measures one feature against the rest of the batch. That is the "batch" in batch normalization, and it is exactly what layer normalization does not do. The statistics are rounded to whole numbers, which is what keeps the rest of the page doable in pen. = 5. Shift to mean 0 = We subtract the mean, in green. The four values in each feature now average to zero. = 6. Scale to variance 1 = Let us divide by the standard deviation, in orange. Each feature now has variance one, whatever scale it arrived at. = 7. Scale and shift = We multiply by a linear transformation and pass the result on. The diagonal and the last column are trainable, so having just forced every feature to mean 0 and variance 1, we hand the network the means to undo it. The outputs: Mean of each feature = [2, 1, 2] Std dev of each feature = [1, 1, 2] To the next layer = [2, -2, 2, 0], [-3, 3, 6, -3], [2, 0, 1, 2] The answers: 🤔 Both. The scale and shift are trainable, the statistics are not. Epsilon and the momentum on the running statistics are the hyper-parameters, and one mini-batch by hand needs neither. 🤔 Features, after the linear layer, not inputs, weights or biases. 🤔 Batch norm measures across the batch, one feature at a time. Layer norm measures across the features, one example at a time. 💾 Save this post!

Tom Yeh

20,518 Aufrufe • vor 9 Tagen

A few people asked if we show videos. No, we do not! It’s all realtime animation, created with code, running in the browser. We run CHROME to check if 1 pixel in the software is *exactly* 1 pixel on the hardware and if we get the right framerate.

A few people asked if we show videos. No, we do not! It’s all realtime animation, created with code, running in the browser. We run CHROME to check if 1 pixel in the software is exactly 1 pixel on the hardware and if we get the right framerate.

Leander Herzog

41,719 Aufrufe • vor 2 Jahren

8 years ago, we became the first network built with and for first responders. With the largest footprint in the country, our mission to empower responders is stronger than ever. See how: FirstNet Authority

8 years ago, we became the first network built with and for first responders. With the largest footprint in the country, our mission to empower responders is stronger than ever. See how: FirstNet Authority

FirstNet, Built with AT&T

12,491 Aufrufe • vor 1 Jahr

Building SPA-like experiences with Next.js With 𝚌𝚊𝚌𝚑𝚎𝙲𝚘𝚖𝚙𝚘𝚗𝚎𝚗𝚝𝚜, the data behind a page can persist across navigations, so a revisit skips the loading fallback. Mark the read with '𝚞𝚜𝚎 𝚌𝚊𝚌𝚑𝚎' so its result is cached instead of re-queried on each render. That cache also persists in the browser, so revisiting the page reuses it without hitting the server again. Live demo, source code, and docs below ↓

Building SPA-like experiences with Next.js With 𝚌𝚊𝚌𝚑𝚎𝙲𝚘𝚖𝚙𝚘𝚗𝚎𝚗𝚝𝚜, the data behind a page can persist across navigations, so a revisit skips the loading fallback. Mark the read with '𝚞𝚜𝚎 𝚌𝚊𝚌𝚑𝚎' so its result is cached instead of re-queried on each render. That cache also persists in the browser, so revisiting the page reuses it without hitting the server again. Live demo, source code, and docs below ↓

Aurora Scharff

26,863 Aufrufe • vor 9 Tagen