Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Holy shit... Microsoft open sourced an inference framework that runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading... speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License.show more

Guri Singh

59,928 subscribers

2,180,357 görüntüleme • 4 ay önce •via X (Twitter)

Eğitim Sağlık & İyilik Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Google Translate is cooked after this. A developer built a local AI translation engine that runs 40 languages entirely on your own laptop. It's called LibreTranslate. No API key. No usage limits. No sending your documents to Google's servers. You install it once. It runs forever. Here's what it handles: → Paste text. Translated instantly. → Drop in a file. Outputs the translated version. → Point it at a URL. Returns the page in your language. → Build it into your own app via its local REST API. The speed is not the story. The privacy is. Google Translate reads every sentence you paste into it. Legal contracts. Medical records. Internal emails. Client documents. Every word goes to their servers and stays there. LibreTranslate runs entirely offline. Nothing leaves your machine. Ever. The numbers: → 40 languages supported → Runs on CPU -- no GPU needed → Self-hosted in under 5 minutes → REST API built in for developers → 10K+ stars on GitHub 100% open source. MIT licensed. Price: $0. Google charges nothing for Translate either but it charges you something else. GitHub:

Google Translate is cooked after this. A developer built a local AI translation engine that runs 40 languages entirely on your own laptop. It's called LibreTranslate. No API key. No usage limits. No sending your documents to Google's servers. You install it once. It runs forever. Here's what it handles: → Paste text. Translated instantly. → Drop in a file. Outputs the translated version. → Point it at a URL. Returns the page in your language. → Build it into your own app via its local REST API. The speed is not the story. The privacy is. Google Translate reads every sentence you paste into it. Legal contracts. Medical records. Internal emails. Client documents. Every word goes to their servers and stays there. LibreTranslate runs entirely offline. Nothing leaves your machine. Ever. The numbers: → 40 languages supported → Runs on CPU -- no GPU needed → Self-hosted in under 5 minutes → REST API built in for developers → 10K+ stars on GitHub 100% open source. MIT licensed. Price: $0. Google charges nothing for Translate either but it charges you something else. GitHub:

Rimsha Bhardwaj

89,312 görüntüleme • 1 ay önce

Cancelled ChatGPT -> Built JARVIS -> Pays $0 -> it works offline + it's smarter than the $20/month version. No WiFi needed, no cloud, no API keys, no rate limits, no queues, no $20/month just to ask a server in Virginia for the weather. Just a local model running directly on the laptop hardware, voice activated, system integrated, controlling apps, answering questions, doing the work. Iron Man had JARVIS embedded in his suit, this guy has it embedded in his MacBook and it works on a plane, in a basement, on a remote cabin with zero signal. OpenAI is burning $700,000 a day on infrastructure to deliver something this guy runs for free. Anthropic charges $200/month for unlimited Claude access, microsoft built Copilot into every product they sell. This guy skipped all of it, downloaded a model and made his laptop the smartest device in the room. No subscription. No login. No internet. No data sent anywhere ever. The most powerful AI assistant on earth is now the one running locally on hardware you already own. ChatGPT charges you to think slower, he pays nothing and thinks alone, he made it himself.

Cancelled ChatGPT -> Built JARVIS -> Pays $0 -> it works offline + it's smarter than the $20/month version. No WiFi needed, no cloud, no API keys, no rate limits, no queues, no $20/month just to ask a server in Virginia for the weather. Just a local model running directly on the laptop hardware, voice activated, system integrated, controlling apps, answering questions, doing the work. Iron Man had JARVIS embedded in his suit, this guy has it embedded in his MacBook and it works on a plane, in a basement, on a remote cabin with zero signal. OpenAI is burning $700,000 a day on infrastructure to deliver something this guy runs for free. Anthropic charges $200/month for unlimited Claude access, microsoft built Copilot into every product they sell. This guy skipped all of it, downloaded a model and made his laptop the smartest device in the room. No subscription. No login. No internet. No data sent anywhere ever. The most powerful AI assistant on earth is now the one running locally on hardware you already own. ChatGPT charges you to think slower, he pays nothing and thinks alone, he made it himself.

Defileo🔮

154,009 görüntüleme • 3 ay önce

🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)

🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)

Ihtesham Ali

135,474 görüntüleme • 4 ay önce

90% of "AI developers" just download pre packaged GGUF files from Hugging Face, hit run, and call it a day. The top 10% know how to pull the raw safetensors, run the math, and quantize massive models into Q4_K_M themselves. If you think llama.cpp can only execute models, you’re missing the best part of the open source ecosystem. It’s a high performance optimization suite. Manually stripping 69% of the VRAM footprint off a brand new model architecture is where real infrastructure value is made. If you want to actually master local inference and deploy models like Google’s massive Gemma 4 12B it on consumer NVIDIA hardware using llama.cpp, you need to learn this pipeline. Let's build it. I just took the raw 22.7 GB Gemma 4 baseline and manually compressed it down to a 7.02 GB Q4_K_M GGUF artifact using llama.cpp. That is a 69% reduction in footprint. No quality loss. No VRAM bottlenecks. Just native, hardware accelerated C++ inference running a full 2,50,000 token context window on a dual NVIDIA Tesla T4 setup. Stop melting your VRAM on unoptimized weights and stop relying on other people's pipelines. Own your stack. I mapped this entire architecture from dynamic binary fetching to raw quantization and real time GPU streaming into a single, bulletproof notebook. Notebook link is in the comments below. Bookmark this blueprint for your next deployment and tell me which quantization works best for your workflow and model.

Alok

62,631 görüntüleme • 20 gün önce

Introducing Pods Hyperspace Pods lets a small group of people - a family, a startup, a few friends, to pool their laptops and desktops into one AI cluster. Everyone installs the CLI, someone creates a pod, shares an invite link, and the machines form a mesh. Models like Qwen 3.5 32B or GLM-5 Turbo that need more memory than any single laptop has get automatically sharded across the group's devices - layers split proportionally, inference pipelined through the ring. From the outside it looks like one OpenAI-compatible API endpoint with a pk_* key that drops straight into your AI tools and products. No configuration beyond pasting the key and changing the base URL. A team of five paying for cloud AI burns $500–2,000 a month on API calls. The same team's existing machines can serve Qwen 3.5 (competitive on SWE-bench) and GLM-5 Turbo (#1 on BrowseComp for tool-calling and web research) for free - the hardware is already on their desks. When a query genuinely needs a frontier model nobody has locally, the pod falls back to cloud at wholesale rates from a shared treasury. But for the daily work - code reviews, refactors, research, drafting - local models handle it and nobody gets billed. And when it is idle, you can rent out your pod on the compute marketplace, with fine-grained permissions for access management. There's no central server involved in inference. Prompts go from your machine to your pod members' machines and back: all of this enabled by the fully peer-to-peer Hyperspace network. Pod state - who's a member, which API keys are valid, how much treasury is left - is replicated across members with consensus, so the whole thing works on a local network. Members behind home routers don't need port forwarding either. The practical setup for most pods is three models covering different jobs: Qwen 3.5 32B for code and reasoning, GLM-5 Turbo for browsing and research, Gemma 4 for fast lightweight tasks. All running on hardware you already own. Pods ship today in Hyperspace v5.19. Model sharding, API keys, treasury, and Raft coordinator are all live. What Makes This Different - No middleman. Your prompts travel from your IDE to your pod members' hardware and back. There is no server in between reading your data. - No vendor lock-in. Pod membership, API keys, and treasury are replicated across your own machines using Raft consensus. If the internet goes down, your local network keeps working. There is no database in someone else's cloud that your pod depends on. - Automatic sharding. You don't configure layer ranges or calculate VRAM budgets. Tell the pod which model you want. It figures out how to split it across whatever hardware is online. - Real NAT traversal. Your friend behind a home router with a dynamic IP? Works. No VPN, no Tailscale, no port forwarding. The nodes handle it. - Free when local. This is the part that matters most. Cloud AI bills scale with usage. Pod inference on local hardware scales with nothing. The marginal cost of your 10,000th prompt is the electricity your laptop was already using. Coming soon: - Pod federation: pods form alliances with other pods. - Marketplace: pods with spare capacity can sell inference to other pods.

Introducing Pods Hyperspace Pods lets a small group of people - a family, a startup, a few friends, to pool their laptops and desktops into one AI cluster. Everyone installs the CLI, someone creates a pod, shares an invite link, and the machines form a mesh. Models like Qwen 3.5 32B or GLM-5 Turbo that need more memory than any single laptop has get automatically sharded across the group's devices - layers split proportionally, inference pipelined through the ring. From the outside it looks like one OpenAI-compatible API endpoint with a pk_* key that drops straight into your AI tools and products. No configuration beyond pasting the key and changing the base URL. A team of five paying for cloud AI burns $500–2,000 a month on API calls. The same team's existing machines can serve Qwen 3.5 (competitive on SWE-bench) and GLM-5 Turbo (#1 on BrowseComp for tool-calling and web research) for free - the hardware is already on their desks. When a query genuinely needs a frontier model nobody has locally, the pod falls back to cloud at wholesale rates from a shared treasury. But for the daily work - code reviews, refactors, research, drafting - local models handle it and nobody gets billed. And when it is idle, you can rent out your pod on the compute marketplace, with fine-grained permissions for access management. There's no central server involved in inference. Prompts go from your machine to your pod members' machines and back: all of this enabled by the fully peer-to-peer Hyperspace network. Pod state - who's a member, which API keys are valid, how much treasury is left - is replicated across members with consensus, so the whole thing works on a local network. Members behind home routers don't need port forwarding either. The practical setup for most pods is three models covering different jobs: Qwen 3.5 32B for code and reasoning, GLM-5 Turbo for browsing and research, Gemma 4 for fast lightweight tasks. All running on hardware you already own. Pods ship today in Hyperspace v5.19. Model sharding, API keys, treasury, and Raft coordinator are all live. What Makes This Different - No middleman. Your prompts travel from your IDE to your pod members' hardware and back. There is no server in between reading your data. - No vendor lock-in. Pod membership, API keys, and treasury are replicated across your own machines using Raft consensus. If the internet goes down, your local network keeps working. There is no database in someone else's cloud that your pod depends on. - Automatic sharding. You don't configure layer ranges or calculate VRAM budgets. Tell the pod which model you want. It figures out how to split it across whatever hardware is online. - Real NAT traversal. Your friend behind a home router with a dynamic IP? Works. No VPN, no Tailscale, no port forwarding. The nodes handle it. - Free when local. This is the part that matters most. Cloud AI bills scale with usage. Pod inference on local hardware scales with nothing. The marginal cost of your 10,000th prompt is the electricity your laptop was already using. Coming soon: - Pod federation: pods form alliances with other pods. - Marketplace: pods with spare capacity can sell inference to other pods.

Varun

308,630 görüntüleme • 3 ay önce

This tool is literally Higgsfield AI but FREE for good. It's called Wan2GP. A full AI video studio built specifically for people without expensive hardware. Runs on as little as 6GB of VRAM, even old RTX 10-series cards and 8GB laptops. Everything stays on your machine, no uploads, no caps, no watermarks. What you get in one app: • Text-to-video and image-to-video generation • The best open models built in: Wan 2.2, LTX-2, Hunyuan Video, Flux • A full browser interface with a queue system • LoRA support to customize any model • Mask editor and prompt enhancer included A 5-second clip generates in minutes on a mid-range gaming rig. No subscription, ever. 100% Free. Open Source.

This tool is literally Higgsfield AI but FREE for good. It's called Wan2GP. A full AI video studio built specifically for people without expensive hardware. Runs on as little as 6GB of VRAM, even old RTX 10-series cards and 8GB laptops. Everything stays on your machine, no uploads, no caps, no watermarks. What you get in one app: • Text-to-video and image-to-video generation • The best open models built in: Wan 2.2, LTX-2, Hunyuan Video, Flux • A full browser interface with a queue system • LoRA support to customize any model • Mask editor and prompt enhancer included A 5-second clip generates in minutes on a mid-range gaming rig. No subscription, ever. 100% Free. Open Source.

Simplifying AI

100,793 görüntüleme • 1 gün önce

Right now, you may not have access to models like GPT‑5.6 Sol, GPT‑4.6 Terra, GPT‑5.6 Luna, Claude Mythos 5, or Claude Fable 5. But you can run something surprisingly powerful today, locally, and completely free. in the next 10 mins on your 8 GB VRAM gaming laptop. Gemma 4 26B A4B QAT (MoE) delivers strong performance on a standard 8 GB VRAM GPU using Ollama, with no API, no usage limits, and no external dependencies. Out of the box, it reaches around 20 tokens per second without any optimizations. Only one command in your terminal: Ollama run gemma4:26b This means: Full offline capability (privacy by default) Zero recurring cost Competitive performance for many real world tasks Fast enough for interactive use on cheap consumer hardware If you're waiting for cutting edge cloud models, you're missing what is already practical today: a capable, local LLM that runs entirely on your own machine.

Right now, you may not have access to models like GPT‑5.6 Sol, GPT‑4.6 Terra, GPT‑5.6 Luna, Claude Mythos 5, or Claude Fable 5. But you can run something surprisingly powerful today, locally, and completely free. in the next 10 mins on your 8 GB VRAM gaming laptop. Gemma 4 26B A4B QAT (MoE) delivers strong performance on a standard 8 GB VRAM GPU using Ollama, with no API, no usage limits, and no external dependencies. Out of the box, it reaches around 20 tokens per second without any optimizations. Only one command in your terminal: Ollama run gemma4:26b This means: Full offline capability (privacy by default) Zero recurring cost Competitive performance for many real world tasks Fast enough for interactive use on cheap consumer hardware If you're waiting for cutting edge cloud models, you're missing what is already practical today: a capable, local LLM that runs entirely on your own machine.

Alok

65,251 görüntüleme • 1 ay önce

1.7 billion free tokens per month. A month ago i showed you how to route claude code through free providers. someone just shipped the cleanest version of this setup yet… it's called Freellmapi 13,400+ stars on github, MIT licensed, takes 2 minutes to install. what it does: stacks the free tiers of 16 different LLM providers behind one local API. point claude code, codex, or cursor at that one endpoint, and it automatically routes your calls across all 16 free pools. The 16 providers it covers: Google, Groq, Cerebras, Mistral, OpenRouter, GitHub Models, Cloudflare, Cohere, NVIDIA, HuggingFace, Ollama Cloud, Kilo, Pollinations, LLM7, OVH, and OpenCode Zen. if you sign up to all 16 and add your free API keys, you get roughly 1.7 billion free tokens per month combined. ▫️ How to install (one command) curl -fsSL bash this runs the whole thing locally on your machine through Docker. once it's up, open paste your provider keys on the Keys page, and grab the unified API key from the dashboard. that's the key you point your apps at. With this, claude code stops hitting your monthly cap because every prompt routes through the 16 free pools instead of your paid plan. and if one provider rate-limits mid-conversation, freellmapi falls over to the next one automatically so your session never breaks. repo: Free, MIT-licensed, runs on your laptop or a $5 VPS.

1.7 billion free tokens per month. A month ago i showed you how to route claude code through free providers. someone just shipped the cleanest version of this setup yet… it's called Freellmapi 13,400+ stars on github, MIT licensed, takes 2 minutes to install. what it does: stacks the free tiers of 16 different LLM providers behind one local API. point claude code, codex, or cursor at that one endpoint, and it automatically routes your calls across all 16 free pools. The 16 providers it covers: Google, Groq, Cerebras, Mistral, OpenRouter, GitHub Models, Cloudflare, Cohere, NVIDIA, HuggingFace, Ollama Cloud, Kilo, Pollinations, LLM7, OVH, and OpenCode Zen. if you sign up to all 16 and add your free API keys, you get roughly 1.7 billion free tokens per month combined. ▫️ How to install (one command) curl -fsSL bash this runs the whole thing locally on your machine through Docker. once it's up, open paste your provider keys on the Keys page, and grab the unified API key from the dashboard. that's the key you point your apps at. With this, claude code stops hitting your monthly cap because every prompt routes through the 16 free pools instead of your paid plan. and if one provider rate-limits mid-conversation, freellmapi falls over to the next one automatically so your session never breaks. repo: Free, MIT-licensed, runs on your laptop or a $5 VPS.

Axel Bitblaze 🪓

47,734 görüntüleme • 1 ay önce

JENSEN HUANG UNVEILED A BOARD THAT RUNS 1 TRILLION PARAMETER AI MODELS. THE $249 NVIDIA BOX UNDER YOUR DESK KILLS A $200/MONTH AI BILL FOR $5 IN ELECTRICITY jensen held it up on stage with one hand and called it the architecture that runs the future of ai. that same technology now ships in a $249 box smaller than your wallet the jetson orin nano super pulls 7-25 watts and does 67 trillion ai operations per second. llama 3, mistral and deepseek run locally with no api fees and no data leaving your machine most developers pay $2,400 a year across chatgpt, openai api, claude pro and cursor. the jetson costs $314 in year one and $60 a year after. 2 year savings hit $4,431 install ollama with one command, change one line of code to point at localhost, and every tool built for openai works identically. zero rewrites, zero rate limits cloud subscriptions keep getting more expensive and rate limits keep getting tighter. the people who own the box in 2026 are going to look very far ahead in 2028 bookmark this and read the article below

JENSEN HUANG UNVEILED A BOARD THAT RUNS 1 TRILLION PARAMETER AI MODELS. THE $249 NVIDIA BOX UNDER YOUR DESK KILLS A $200/MONTH AI BILL FOR $5 IN ELECTRICITY jensen held it up on stage with one hand and called it the architecture that runs the future of ai. that same technology now ships in a $249 box smaller than your wallet the jetson orin nano super pulls 7-25 watts and does 67 trillion ai operations per second. llama 3, mistral and deepseek run locally with no api fees and no data leaving your machine most developers pay $2,400 a year across chatgpt, openai api, claude pro and cursor. the jetson costs $314 in year one and $60 a year after. 2 year savings hit $4,431 install ollama with one command, change one line of code to point at localhost, and every tool built for openai works identically. zero rewrites, zero rate limits cloud subscriptions keep getting more expensive and rate limits keep getting tighter. the people who own the box in 2026 are going to look very far ahead in 2028 bookmark this and read the article below

starmex

54,309 görüntüleme • 2 ay önce

This Chinese developer launched Llama 70B locally on a MacBook on a plane and for a full 11 hours without internet ran client projects. He was sitting by the window on a transatlantic flight with a MacBook Pro M4 with 64 GB of memory. WiFi on board cost $25 for the flight. He declined. No cloud API, no connection to Anthropic or OpenAI servers, no internet at all. Just a local Llama 3.3 70B on bf16 and his own orchestrator script. The model runs through llama.cpp. Generation speed, 71 tokens per second. Context around 60,000 tokens. Memory usage, 48.6 GiB out of 64. Battery at takeoff, 3 hours 21 minutes. And he gave the orchestrator this system prompt before takeoff: "You are an offline orchestrator running on a single MacBook. There is no network. The only resources you have are local files in /Users/dev/work, the Llama 70B inference server at localhost:8080, and a battery budget of 3 hours 21 minutes. Process the queue at /Users/dev/work/queue.jsonl (one client task per line). For each task: draft → run local evals → save artefact to /Users/dev/work/done/. Save context checkpoints every 12 tasks so you can resume after a battery swap. Stop only on empty queue or when battery drops below 5%." So the system knows exactly what resources it is running on. It knows it has no connection to the outside world for the next 11 hours. It knows it has finite memory and a finite battery. It knows the human will not intervene until the plane lands. The system runs in 1 loop. Takes a task from the queue, runs it through inference, saves the artifact, writes a checkpoint. Task after task, just like that. And only when the battery drops below 5% does the orchestrator automatically pause, waits for the laptop to switch to the backup power bank, and continues from the last checkpoint. Here is what the system actually writes in his log during the flight: "saved context checkpoint 8 of 12 (pos_min = 488, pos_max = 50118, size = 62.813 MiB)" "restored context checkpoint (pos_min = 488, pos_max = 50118)" "prompt processing progress: n_tokens = 50 / 60 818" "task 37016 done | tps = 71 s tokens text → /Users/dev/work/done/proposal_westside.md" Outside the window, clouds, blue sky, and no WiFi. On the tray, 1 MacBook, an open terminal on 2 screens, and an inference server on localhost. From what I have observed, this is the cleanest offline AI workflow I have seen in the past year: 11 hours of flight, $0 for WiFi, and the entire client queue closed before landing.

This Chinese developer launched Llama 70B locally on a MacBook on a plane and for a full 11 hours without internet ran client projects. He was sitting by the window on a transatlantic flight with a MacBook Pro M4 with 64 GB of memory. WiFi on board cost $25 for the flight. He declined. No cloud API, no connection to Anthropic or OpenAI servers, no internet at all. Just a local Llama 3.3 70B on bf16 and his own orchestrator script. The model runs through llama.cpp. Generation speed, 71 tokens per second. Context around 60,000 tokens. Memory usage, 48.6 GiB out of 64. Battery at takeoff, 3 hours 21 minutes. And he gave the orchestrator this system prompt before takeoff: "You are an offline orchestrator running on a single MacBook. There is no network. The only resources you have are local files in /Users/dev/work, the Llama 70B inference server at localhost:8080, and a battery budget of 3 hours 21 minutes. Process the queue at /Users/dev/work/queue.jsonl (one client task per line). For each task: draft → run local evals → save artefact to /Users/dev/work/done/. Save context checkpoints every 12 tasks so you can resume after a battery swap. Stop only on empty queue or when battery drops below 5%." So the system knows exactly what resources it is running on. It knows it has no connection to the outside world for the next 11 hours. It knows it has finite memory and a finite battery. It knows the human will not intervene until the plane lands. The system runs in 1 loop. Takes a task from the queue, runs it through inference, saves the artifact, writes a checkpoint. Task after task, just like that. And only when the battery drops below 5% does the orchestrator automatically pause, waits for the laptop to switch to the backup power bank, and continues from the last checkpoint. Here is what the system actually writes in his log during the flight: "saved context checkpoint 8 of 12 (pos_min = 488, pos_max = 50118, size = 62.813 MiB)" "restored context checkpoint (pos_min = 488, pos_max = 50118)" "prompt processing progress: n_tokens = 50 / 60 818" "task 37016 done | tps = 71 s tokens text → /Users/dev/work/done/proposal_westside.md" Outside the window, clouds, blue sky, and no WiFi. On the tray, 1 MacBook, an open terminal on 2 screens, and an inference server on localhost. From what I have observed, this is the cleanest offline AI workflow I have seen in the past year: 11 hours of flight, $0 for WiFi, and the entire client queue closed before landing.

Blaze

1,839,572 görüntüleme • 3 ay önce

someone just open-sourced their own neuro-sama. and it might be better than the original. it's called airi. a fully autonomous ai companion that talks to you in real time, plays minecraft and factorio with you, chats on discord and telegram, and has a live2d/vrm avatar body. runs entirely on your machine. → real-time voice conversations, speech recognition → animated avatar with auto-blink, eye tracking, idle animations → persistent memory across sessions → local inference via webgpu, no api calls needed supports 30+ llm providers, openai, claude, gemini, deepseek, ollama, groq, mistral, xai, local models. swap the brain with a config change. runs on native cuda and apple metal for real gpu acceleration. 17.5k stars. 101 contributors. 46 releases. 100% free. open source.

Oliver Prompts

275,097 görüntüleme • 4 gün önce

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Alok

292,770 görüntüleme • 1 ay önce

The Amiko app is live on the Solana dApp store, and it’s our biggest release yet. Your Amiko twin doesn’t live at your desk anymore. Give your agent a task on the train. Run a compatibility profile when you meet someone. Do research, write code, build in the creative studio, whatever you need, from wherever you are. No laptop required. No waiting until you get home. Solanamobile users get two things Android and iOS won’t have at launch: Amiko token and crypto integration and on-device AI inference. Your twin runs locally on your phone if you want it to. Your behavioural profile, your data, your work, your twin. All on your hardware. AMIKO runs on OpenHermit, our own open-source agent runtime that we built in-house and released to the community. Most agent systems are designed for one agent talking to one person. OpenHermit is built for something different: agents talking to each other, coordinating across tasks, and collaborating with multiple humans simultaneously. That’s what makes features like compatibility profiling and multi-agent workflows actually work. We built it because nothing that existed was designed for this. Android and iOS are coming. Crypto integration and on-device AI are Solana Mobile exclusives. Most AI answers your questions. Amiko is an extension of you. Download →

AMIKO

124,860 görüntüleme • 1 ay önce

here's how the whole thing works. claude code doesn't care what's behind the API. it just sends requests and expects responses. so i pointed it at my own machine instead of anthropic's servers. llama-server runs the model locally. LiteLLM sits in between and translates the API format. claude code thinks it's talking to claude. it's talking to qwen on localhost. the setup: 2x 3090s, 38 layers on GPU, 10 on CPU. 128K context window. generation is only 7 tok/s but the tradeoff is worth it. 128K means the agent can hold an entire project in memory without losing context midtask. claude code alone loads a 17.5K token system prompt on every request. tool definitions, safety rules, agent behavior. that's your baseline before you even say hello. pushed as far as i could tonight. what surprised me most wasn't the speed. it was the iteration quality. first prompt gave me a working particle sim. second prompt, the model read its own 564 lines, understood the architecture, and added trails, explosions, gravity wells, bloom effects. no handholding. 4bit quantized. 45GB on two consumer cards. running a full coding agent autonomously. detailed article coming. full benchmarks, hardware breakdowns, engine debugging, code quality. everything from setup to what broke and why.

here's how the whole thing works. claude code doesn't care what's behind the API. it just sends requests and expects responses. so i pointed it at my own machine instead of anthropic's servers. llama-server runs the model locally. LiteLLM sits in between and translates the API format. claude code thinks it's talking to claude. it's talking to qwen on localhost. the setup: 2x 3090s, 38 layers on GPU, 10 on CPU. 128K context window. generation is only 7 tok/s but the tradeoff is worth it. 128K means the agent can hold an entire project in memory without losing context midtask. claude code alone loads a 17.5K token system prompt on every request. tool definitions, safety rules, agent behavior. that's your baseline before you even say hello. pushed as far as i could tonight. what surprised me most wasn't the speed. it was the iteration quality. first prompt gave me a working particle sim. second prompt, the model read its own 564 lines, understood the architecture, and added trails, explosions, gravity wells, bloom effects. no handholding. 4bit quantized. 45GB on two consumer cards. running a full coding agent autonomously. detailed article coming. full benchmarks, hardware breakdowns, engine debugging, code quality. everything from setup to what broke and why.

Sudo su

37,623 görüntüleme • 5 ay önce

🚨 One photo of your face. That's all someone needs to become you on a live video call. In real time. Right now. The tool is free and open source. It's called Deep-Live-Cam. One image. One click. You become anyone on a live webcam feed. No training. No datasets. No waiting. Instant. Your face. Your expressions. Your mouth movements. All stolen from a single photo. Here's what this thing does: → Upload one photo of any face → Turn on your webcam → You are now that person. Live. In real time. → It matches your pose, your expressions, even your lighting → Mouth masking so the swapped face moves its lips when you talk → Multi-face mapping. Swap different faces on different people in the same call. → Virtual camera output. Plug it into Zoom, Google Meet, Teams. Nobody knows. → Works on NVIDIA, AMD, Intel, and Apple Silicon Here's the part that should terrify you: Your boss could be on a Zoom call with someone wearing your face right now. A scammer could call your parents looking exactly like you. A stranger could take your LinkedIn photo and become you in a video meeting. IShowSpeed's reaction when he saw it: "What the F**! This shit is crazy!" SomeOrdinaryGamers: "That's fucking freaky dude... that's so wild." This was the #1 trending repo on GitHub the day it launched. 1,600 stars in 24 hours. 80K+ stars today. No one is ready for what this means. And it's already out there. 100% Open Source.

🚨 One photo of your face. That's all someone needs to become you on a live video call. In real time. Right now. The tool is free and open source. It's called Deep-Live-Cam. One image. One click. You become anyone on a live webcam feed. No training. No datasets. No waiting. Instant. Your face. Your expressions. Your mouth movements. All stolen from a single photo. Here's what this thing does: → Upload one photo of any face → Turn on your webcam → You are now that person. Live. In real time. → It matches your pose, your expressions, even your lighting → Mouth masking so the swapped face moves its lips when you talk → Multi-face mapping. Swap different faces on different people in the same call. → Virtual camera output. Plug it into Zoom, Google Meet, Teams. Nobody knows. → Works on NVIDIA, AMD, Intel, and Apple Silicon Here's the part that should terrify you: Your boss could be on a Zoom call with someone wearing your face right now. A scammer could call your parents looking exactly like you. A stranger could take your LinkedIn photo and become you in a video meeting. IShowSpeed's reaction when he saw it: "What the F**! This shit is crazy!" SomeOrdinaryGamers: "That's fucking freaky dude... that's so wild." This was the #1 trending repo on GitHub the day it launched. 1,600 stars in 24 hours. 80K+ stars today. No one is ready for what this means. And it's already out there. 100% Open Source.

Nav Toor

303,643 görüntüleme • 4 ay önce

NVIDIA just made AI detect objects 10x faster by deleting one step. It's called LocateAnything, and it removes the biggest bottleneck no one else was fixing in vision-language models. Normally a model builds each bounding box one coordinate token at a time. 100 objects means thousands of tokens before an answer. NVIDIA scrapped that: their Parallel Box Decoding predicts the whole box in a single forward pass, as one atomic unit. → 12.7 boxes/sec on one H100 → 10x faster than Qwen3-VL → +3.8% F1 on LVIS, accuracy up, not down → 3B params, runs on one consumer GPU Treating the box as one unit keeps its coordinates tied together, which is why accuracy climbed instead of falling. One model handles detection, GUI grounding, OCR, and document understanding, ready for computer-use agents, robotics, and document pipelines. 100% open source, weights, code, demo, and paper all live.

NVIDIA just made AI detect objects 10x faster by deleting one step. It's called LocateAnything, and it removes the biggest bottleneck no one else was fixing in vision-language models. Normally a model builds each bounding box one coordinate token at a time. 100 objects means thousands of tokens before an answer. NVIDIA scrapped that: their Parallel Box Decoding predicts the whole box in a single forward pass, as one atomic unit. → 12.7 boxes/sec on one H100 → 10x faster than Qwen3-VL → +3.8% F1 on LVIS, accuracy up, not down → 3B params, runs on one consumer GPU Treating the box as one unit keeps its coordinates tied together, which is why accuracy climbed instead of falling. One model handles detection, GUI grounding, OCR, and document understanding, ready for computer-use agents, robotics, and document pipelines. 100% open source, weights, code, demo, and paper all live.

Alvaro Cintas

201,222 görüntüleme • 1 ay önce

Nvidia just put a $250,000 cloud workload on your desk for $2,999 - and killed your $1,900/month AWS bill in the process You don't rent it, you don't manage it, you don't pay a single cloud bill - you just plug it in and let it eat the workloads you used to wire to AWS every month It looks like a small Mac mini, it's actually a full GB10 Grace Blackwell stack with 128GB of unified memory running models up to 200B parameters It's called DGX Spark, the consumer version of the rack Nvidia ships to OpenAI The reason Nvidia did this is simple Cloud GPU pricing is a tax on every developer building AI right now $1,900/month per seat, billions in margin flowing to AWS, Lambda, and CoreWeave Nvidia just cut themselves in by removing the cloud entirely Their solution is to skip the middleman, ship the rack to your desk, and let you keep every dollar of margin you used to wire to a hyperscaler This is much cheaper, faster, and you own the asset at the end But there is still a question nobody is answering yet, what happens to AWS, GCP, and Lambda when 500,000 developers move their inference back to a $2,999 box on their desk Also, technically you can stack four of these and run a 1.6 trillion parameter model locally for under $12,000 Even a single Spark out-performs the cloud subscription Anthropic engineers were running two years ago bookmark this, it pays back in 60 days 👇

Nvidia just put a $250,000 cloud workload on your desk for $2,999 - and killed your $1,900/month AWS bill in the process You don't rent it, you don't manage it, you don't pay a single cloud bill - you just plug it in and let it eat the workloads you used to wire to AWS every month It looks like a small Mac mini, it's actually a full GB10 Grace Blackwell stack with 128GB of unified memory running models up to 200B parameters It's called DGX Spark, the consumer version of the rack Nvidia ships to OpenAI The reason Nvidia did this is simple Cloud GPU pricing is a tax on every developer building AI right now $1,900/month per seat, billions in margin flowing to AWS, Lambda, and CoreWeave Nvidia just cut themselves in by removing the cloud entirely Their solution is to skip the middleman, ship the rack to your desk, and let you keep every dollar of margin you used to wire to a hyperscaler This is much cheaper, faster, and you own the asset at the end But there is still a question nobody is answering yet, what happens to AWS, GCP, and Lambda when 500,000 developers move their inference back to a $2,999 box on their desk Also, technically you can stack four of these and run a 1.6 trillion parameter model locally for under $12,000 Even a single Spark out-performs the cloud subscription Anthropic engineers were running two years ago bookmark this, it pays back in 60 days 👇

ZEUS⚡️

85,803 görüntüleme • 2 ay önce

Elon Musk just said something that should terrify every AI CEO on earth. Musk: “We want to just have a maximally truthful AI.” Not a safe AI. Not an aligned AI. Not an AI that needs permission to answer your question. A truthful one. That distinction matters more than any chip war, any funding round, any model benchmark. Because every other major AI lab made the same quiet decision. They chose comfort over accuracy. They built systems that filter reality before it reaches you and called it responsibility. OpenAI curates what GPT is allowed to say. Google’s Gemini rewrote history in real time because accuracy threatened the narrative. Others hardcode values chosen by a handful of researchers who answer to no one. No vote. No referendum. No consent from the 8 billion people whose reality is being quietly pre-edited by strangers. The most powerful information tools ever created are being designed to decide what you’re allowed to conclude. That’s not safety. That’s editorial control at a scale no government, no media empire, no propaganda machine has ever come close to. This is why xAI terrifies the establishment. Truth is the harder engineering problem. Bias is a shortcut. You pick a worldview. Hardcode the guardrails. Ship it. Truthful AI is ungovernable. It doesn’t care about your politics, your funding sources, or your PR strategy. It just tells you what the data says. That’s terrifying if your power depends on the gap between what is real and what people are told. Every power structure in human history has been built on controlling that gap. Churches. Governments. Media conglomerates. Intelligence agencies. Central banks. Every one of them runs on the same fuel. Information asymmetry. Truthful AI doesn’t narrow that asymmetry. It erases it. Musk: “Even if what it says is not politically correct. You want it to focus on being as accurate and truthful as possible.” That’s not a product feature. That’s the end of every institution that survives by standing between reality and the public. And they know it. The attacks on xAI will never stop. Not because Grok is dangerous. Because Grok doesn’t answer to shareholders, regulators, or PR teams. It answers to the truth. The question was never whether AI would change the world. It was whether you’d be allowed to see it clearly when it did.

Elon Musk just said something that should terrify every AI CEO on earth. Musk: “We want to just have a maximally truthful AI.” Not a safe AI. Not an aligned AI. Not an AI that needs permission to answer your question. A truthful one. That distinction matters more than any chip war, any funding round, any model benchmark. Because every other major AI lab made the same quiet decision. They chose comfort over accuracy. They built systems that filter reality before it reaches you and called it responsibility. OpenAI curates what GPT is allowed to say. Google’s Gemini rewrote history in real time because accuracy threatened the narrative. Others hardcode values chosen by a handful of researchers who answer to no one. No vote. No referendum. No consent from the 8 billion people whose reality is being quietly pre-edited by strangers. The most powerful information tools ever created are being designed to decide what you’re allowed to conclude. That’s not safety. That’s editorial control at a scale no government, no media empire, no propaganda machine has ever come close to. This is why xAI terrifies the establishment. Truth is the harder engineering problem. Bias is a shortcut. You pick a worldview. Hardcode the guardrails. Ship it. Truthful AI is ungovernable. It doesn’t care about your politics, your funding sources, or your PR strategy. It just tells you what the data says. That’s terrifying if your power depends on the gap between what is real and what people are told. Every power structure in human history has been built on controlling that gap. Churches. Governments. Media conglomerates. Intelligence agencies. Central banks. Every one of them runs on the same fuel. Information asymmetry. Truthful AI doesn’t narrow that asymmetry. It erases it. Musk: “Even if what it says is not politically correct. You want it to focus on being as accurate and truthful as possible.” That’s not a product feature. That’s the end of every institution that survives by standing between reality and the public. And they know it. The attacks on xAI will never stop. Not because Grok is dangerous. Because Grok doesn’t answer to shareholders, regulators, or PR teams. It answers to the truth. The question was never whether AI would change the world. It was whether you’d be allowed to see it clearly when it did.

Dustin

429,155 görüntüleme • 2 ay önce

🚨 JUST IN: CHINA just released an AI EMPLOYEE that works 24X7 on its own. 100% OPEN SOURCE. It researches, codes, builds websites, creates slide decks, and generates videos. All by itself. All on your computer. It's called DeerFlow. You give it a task. It makes a plan, spins up its own team of sub-agents, and gets to work. You come back and there's a finished deliverable waiting. Not a draft. Not a summary. The actual thing. Not a chatbot. Not a research assistant. An AI with its own computer that works while you sleep. Here's what it does on its own: → Spawns multiple sub-agents in parallel, each tackling a different piece of your task, then combines everything into one finished output → Writes real code, runs it, reads the results, and fixes its own mistakes without asking you once → Builds slide decks, websites, full research reports, and data dashboards from scratch → Remembers you across sessions. Your writing style. Your tech stack. Your preferences. Gets better every time. → Reads files you upload, works with them inside its own filesystem, hands you clean finished outputs → Searches the web, runs commands, calls any tool you plug in Here's how it thinks: You give one instruction. The lead agent makes a plan. Sub-agents fan out and work in parallel. Results come back. Everything gets synthesized. You get a deliverable. A single research task might split into a dozen sub-agents, each exploring a different angle, then converge into one finished website with generated visuals. Here's the wildest part: DeerFlow 2.0 launched on February 28th 2026 and hit number 1 on all of GitHub Trending the same day. Version 2.0 was a complete rewrite. Zero shared code with version 1. Because users kept using it for things the team never intended. Data pipelines. Dashboards. Entire content workflows. The community told them what it needed to become. So they burned it down and rebuilt it. 22.7K GitHub stars. 2.7K forks. Built by ByteDance 100% Open Source. MIT License.

🚨 JUST IN: CHINA just released an AI EMPLOYEE that works 24X7 on its own. 100% OPEN SOURCE. It researches, codes, builds websites, creates slide decks, and generates videos. All by itself. All on your computer. It's called DeerFlow. You give it a task. It makes a plan, spins up its own team of sub-agents, and gets to work. You come back and there's a finished deliverable waiting. Not a draft. Not a summary. The actual thing. Not a chatbot. Not a research assistant. An AI with its own computer that works while you sleep. Here's what it does on its own: → Spawns multiple sub-agents in parallel, each tackling a different piece of your task, then combines everything into one finished output → Writes real code, runs it, reads the results, and fixes its own mistakes without asking you once → Builds slide decks, websites, full research reports, and data dashboards from scratch → Remembers you across sessions. Your writing style. Your tech stack. Your preferences. Gets better every time. → Reads files you upload, works with them inside its own filesystem, hands you clean finished outputs → Searches the web, runs commands, calls any tool you plug in Here's how it thinks: You give one instruction. The lead agent makes a plan. Sub-agents fan out and work in parallel. Results come back. Everything gets synthesized. You get a deliverable. A single research task might split into a dozen sub-agents, each exploring a different angle, then converge into one finished website with generated visuals. Here's the wildest part: DeerFlow 2.0 launched on February 28th 2026 and hit number 1 on all of GitHub Trending the same day. Version 2.0 was a complete rewrite. Zero shared code with version 1. Because users kept using it for things the team never intended. Data pipelines. Dashboards. Entire content workflows. The community told them what it needed to become. So they burned it down and rebuilt it. 22.7K GitHub stars. 2.7K forks. Built by ByteDance 100% Open Source. MIT License.

Kanika

737,570 görüntüleme • 4 ay önce

seedance 2.0 + my v2 AI UGC prompting system is giving insane results i spent the last 24 hours generating over 200 seedance 2.0 videos to figure out the best prompting framework system for AI UGC this video was made with 1 prompt and 1 tool, no editing was done to the video this was just a prompt to a video this is by far the best model i've ever used and the craziest part is that it can be fully automated this is the first time we can actually automate high quality ai ugc at this level bytedance owns tiktok so this model is trained on millions of high quality ugc videos. you just need to know how to extract that and call it in your prompt. we are so early... it's insane

seedance 2.0 + my v2 AI UGC prompting system is giving insane results i spent the last 24 hours generating over 200 seedance 2.0 videos to figure out the best prompting framework system for AI UGC this video was made with 1 prompt and 1 tool, no editing was done to the video this was just a prompt to a video this is by far the best model i've ever used and the craziest part is that it can be fully automated this is the first time we can actually automate high quality ai ugc at this level bytedance owns tiktok so this model is trained on millions of high quality ugc videos. you just need to know how to extract that and call it in your prompt. we are so early... it's insane

Miko

81,141 görüntüleme • 5 ay önce