Holy shit... Microsoft open sourced an inference framework that... runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License.show more

Guri Singh
2,180,357 Aufrufe • vor 3 Monaten
Someone just built a desktop app that that generates... 3D models from images and runs 100% locally. It's called Modly. It runs entirely on your GPU, no cloud, no API bills. Just drop an image and get a 3D mesh. 100% Open Source.show more

How To Prompt
222,682 Aufrufe • vor 2 Monaten
Google Translate is cooked after this. A developer built... a local AI translation engine that runs 40 languages entirely on your own laptop. It's called LibreTranslate. No API key. No usage limits. No sending your documents to Google's servers. You install it once. It runs forever. Here's what it handles: → Paste text. Translated instantly. → Drop in a file. Outputs the translated version. → Point it at a URL. Returns the page in your language. → Build it into your own app via its local REST API. The speed is not the story. The privacy is. Google Translate reads every sentence you paste into it. Legal contracts. Medical records. Internal emails. Client documents. Every word goes to their servers and stays there. LibreTranslate runs entirely offline. Nothing leaves your machine. Ever. The numbers: → 40 languages supported → Runs on CPU -- no GPU needed → Self-hosted in under 5 minutes → REST API built in for developers → 10K+ stars on GitHub 100% open source. MIT licensed. Price: $0. Google charges nothing for Translate either but it charges you something else. GitHub:show more

Rimsha Bhardwaj
88,180 Aufrufe • vor 9 Tagen
PewDiePie just hit 20K GitHub stars in under 24... hours. The project? Odysseus. A self-hosted AI workspace that runs 100% on your machine. • Agents with tools • MCP built in • Persistent memory • File handling • Windows, macOS, Linux Your data never leaves your device. It supports Ollama, llama.cpp, and vLLM locally with OpenAI and OpenRouter support if you want cloud models too. The crazy part? A YouTuber with 110M+ subscribers just out-shipped most AI startups. And he built half of it using AI.show more

Charlie Hills
16,284 Aufrufe • vor 23 Tagen
Cancelled ChatGPT -> Built JARVIS -> Pays $0 ->... it works offline + it's smarter than the $20/month version. No WiFi needed, no cloud, no API keys, no rate limits, no queues, no $20/month just to ask a server in Virginia for the weather. Just a local model running directly on the laptop hardware, voice activated, system integrated, controlling apps, answering questions, doing the work. Iron Man had JARVIS embedded in his suit, this guy has it embedded in his MacBook and it works on a plane, in a basement, on a remote cabin with zero signal. OpenAI is burning $700,000 a day on infrastructure to deliver something this guy runs for free. Anthropic charges $200/month for unlimited Claude access, microsoft built Copilot into every product they sell. This guy skipped all of it, downloaded a model and made his laptop the smartest device in the room. No subscription. No login. No internet. No data sent anywhere ever. The most powerful AI assistant on earth is now the one running locally on hardware you already own. ChatGPT charges you to think slower, he pays nothing and thinks alone, he made it himself.show more

Defileo🔮
153,466 Aufrufe • vor 1 Monat
🚨 Alibaba just open sourced a GUI agent that... lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)show more

Ihtesham Ali
134,969 Aufrufe • vor 3 Monaten
this is the worst local AI will ever be.... tomorrow it gets faster. next month the models get smarter. next year your GPU runs what a data center runs today. Qwen3.5-35B-A3B on a single 3090. told it to visualize its own expert routing. 256 experts, 8 active per token, rendered in 3D on the same GPU running inference. no API key. no subscription. no permission needed. closed AI isn't losing ground. it's losing the argument.show more

Sudo su
106,710 Aufrufe • vor 3 Monaten
Introducing Pods Hyperspace Pods lets a small group of... people - a family, a startup, a few friends, to pool their laptops and desktops into one AI cluster. Everyone installs the CLI, someone creates a pod, shares an invite link, and the machines form a mesh. Models like Qwen 3.5 32B or GLM-5 Turbo that need more memory than any single laptop has get automatically sharded across the group's devices - layers split proportionally, inference pipelined through the ring. From the outside it looks like one OpenAI-compatible API endpoint with a pk_* key that drops straight into your AI tools and products. No configuration beyond pasting the key and changing the base URL. A team of five paying for cloud AI burns $500–2,000 a month on API calls. The same team's existing machines can serve Qwen 3.5 (competitive on SWE-bench) and GLM-5 Turbo (#1 on BrowseComp for tool-calling and web research) for free - the hardware is already on their desks. When a query genuinely needs a frontier model nobody has locally, the pod falls back to cloud at wholesale rates from a shared treasury. But for the daily work - code reviews, refactors, research, drafting - local models handle it and nobody gets billed. And when it is idle, you can rent out your pod on the compute marketplace, with fine-grained permissions for access management. There's no central server involved in inference. Prompts go from your machine to your pod members' machines and back: all of this enabled by the fully peer-to-peer Hyperspace network. Pod state - who's a member, which API keys are valid, how much treasury is left - is replicated across members with consensus, so the whole thing works on a local network. Members behind home routers don't need port forwarding either. The practical setup for most pods is three models covering different jobs: Qwen 3.5 32B for code and reasoning, GLM-5 Turbo for browsing and research, Gemma 4 for fast lightweight tasks. All running on hardware you already own. Pods ship today in Hyperspace v5.19. Model sharding, API keys, treasury, and Raft coordinator are all live. What Makes This Different - No middleman. Your prompts travel from your IDE to your pod members' hardware and back. There is no server in between reading your data. - No vendor lock-in. Pod membership, API keys, and treasury are replicated across your own machines using Raft consensus. If the internet goes down, your local network keeps working. There is no database in someone else's cloud that your pod depends on. - Automatic sharding. You don't configure layer ranges or calculate VRAM budgets. Tell the pod which model you want. It figures out how to split it across whatever hardware is online. - Real NAT traversal. Your friend behind a home router with a dynamic IP? Works. No VPN, no Tailscale, no port forwarding. The nodes handle it. - Free when local. This is the part that matters most. Cloud AI bills scale with usage. Pod inference on local hardware scales with nothing. The marginal cost of your 10,000th prompt is the electricity your laptop was already using. Coming soon: - Pod federation: pods form alliances with other pods. - Marketplace: pods with spare capacity can sell inference to other pods.show more

Varun
306,336 Aufrufe • vor 2 Monaten
Just dropped on HF — NeuTTS Air Next-gen on-device... TTS that matches cloud-level quality while staying fully open source. > Real-time speech synthesis on CPU/GPU > 3-second voice cloning, no cloud or data upload > Compact: under 200 MB, runs on mobile and edge devices > Multilingual and expressive > Developed by Neuphonic , optimized for speed and fidelityshow more

steven
72,273 Aufrufe • vor 8 Monaten
🔥 BREAKING: Open source just leveled up AI agents... Eigent gives you a fully local, customizable AI workforce....built to run on your laptop. → No vendor lock-in → No cloud dependency → 100% open source Just fast, private, parallel agents you control (Here's how):👇show more

Shruti
63,497 Aufrufe • vor 11 Monaten
JENSEN HUANG UNVEILED A BOARD THAT RUNS 1 TRILLION... PARAMETER AI MODELS. THE $249 NVIDIA BOX UNDER YOUR DESK KILLS A $200/MONTH AI BILL FOR $5 IN ELECTRICITY jensen held it up on stage with one hand and called it the architecture that runs the future of ai. that same technology now ships in a $249 box smaller than your wallet the jetson orin nano super pulls 7-25 watts and does 67 trillion ai operations per second. llama 3, mistral and deepseek run locally with no api fees and no data leaving your machine most developers pay $2,400 a year across chatgpt, openai api, claude pro and cursor. the jetson costs $314 in year one and $60 a year after. 2 year savings hit $4,431 install ollama with one command, change one line of code to point at localhost, and every tool built for openai works identically. zero rewrites, zero rate limits cloud subscriptions keep getting more expensive and rate limits keep getting tighter. the people who own the box in 2026 are going to look very far ahead in 2028 bookmark this and read the article belowshow more

starmex
54,309 Aufrufe • vor 24 Tagen
This Chinese developer launched Llama 70B locally on a... MacBook on a plane and for a full 11 hours without internet ran client projects. He was sitting by the window on a transatlantic flight with a MacBook Pro M4 with 64 GB of memory. WiFi on board cost $25 for the flight. He declined. No cloud API, no connection to Anthropic or OpenAI servers, no internet at all. Just a local Llama 3.3 70B on bf16 and his own orchestrator script. The model runs through llama.cpp. Generation speed, 71 tokens per second. Context around 60,000 tokens. Memory usage, 48.6 GiB out of 64. Battery at takeoff, 3 hours 21 minutes. And he gave the orchestrator this system prompt before takeoff: "You are an offline orchestrator running on a single MacBook. There is no network. The only resources you have are local files in /Users/dev/work, the Llama 70B inference server at localhost:8080, and a battery budget of 3 hours 21 minutes. Process the queue at /Users/dev/work/queue.jsonl (one client task per line). For each task: draft → run local evals → save artefact to /Users/dev/work/done/. Save context checkpoints every 12 tasks so you can resume after a battery swap. Stop only on empty queue or when battery drops below 5%." So the system knows exactly what resources it is running on. It knows it has no connection to the outside world for the next 11 hours. It knows it has finite memory and a finite battery. It knows the human will not intervene until the plane lands. The system runs in 1 loop. Takes a task from the queue, runs it through inference, saves the artifact, writes a checkpoint. Task after task, just like that. And only when the battery drops below 5% does the orchestrator automatically pause, waits for the laptop to switch to the backup power bank, and continues from the last checkpoint. Here is what the system actually writes in his log during the flight: "saved context checkpoint 8 of 12 (pos_min = 488, pos_max = 50118, size = 62.813 MiB)" "restored context checkpoint (pos_min = 488, pos_max = 50118)" "prompt processing progress: n_tokens = 50 / 60 818" "task 37016 done | tps = 71 s tokens text → /Users/dev/work/done/proposal_westside.md" Outside the window, clouds, blue sky, and no WiFi. On the tray, 1 MacBook, an open terminal on 2 screens, and an inference server on localhost. From what I have observed, this is the cleanest offline AI workflow I have seen in the past year: 11 hours of flight, $0 for WiFi, and the entire client queue closed before landing.show more

Blaze
1,824,930 Aufrufe • vor 1 Monat
NVIDIA just made paying for AI feel optional. Open... model, a million tokens of context, free tier with no per-token cost, runs on your own hardware. Entire codebases, whole data rooms, a year of chat logs, all swallowed in one prompt. No chunking, no RAG, no rate limit theater. The closed-AI premium has 90 days to defend itself. Bookmark this and come back. Open beat closed. Again.show more

shmidt
294,954 Aufrufe • vor 18 Tagen
BlackBird now runs on 8GB RAM Macs. No GPU.... No cloud. Just fast, private AI agents - right on your MacBook Air. We optimized memory, speed, and thermal performance so anyone can build with AI. Try it: Next Stop: Windows Beta Drops This Week! DM Me if you want to try it. #OnDeviceAI #BlackBird #AIforEveryone #macOSshow more

Hina Dixit
1,233,525 Aufrufe • vor 1 Jahr
Today we’re open-sourcing Stable Audio Open Small, a 341M-parameter... text-to-audio model optimized to run entirely on Arm CPUs. This means 99% of smartphones can now generate music-production samples in seconds, right on-device with no internet required. Built for fast, on-the-go creation, it turns your next quick idea into up to 11 seconds of audio. Generate drum loops, foley, riffs, and textures right where you are. No cords 🔌 just chords 🎹 You can learn more here:show more

Stability AI
94,773 Aufrufe • vor 1 Jahr
Llama 3.2 is the latest open-source AI model from... Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:show more

Akash Network
37,087 Aufrufe • vor 1 Jahr
The first phone where your AI never leaves your... device. No cloud processing. No data harvesting. Complete AI sovereignty. Built on Galaxy S25 Edge hardware. Earn rewards through the Gaia network. 1,000 units now available. Additional releases planned.show more

Gaia 🌱
157,416 Aufrufe • vor 9 Monaten
Run Gemma 4 26B MoE on 8GB VRAM with... 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the repliesshow more

Alok
290,161 Aufrufe • vor 17 Tagen
Meet Stable Audio 3.0, the open-weight model family built... for artistic experimentation. This is our open invitation to experiment with generative audio. We believe the best innovations are still waiting to be built. The 4-1-1 on 3.0: 📣 You own your outputs, and can distribute and commercialize them under the Stability AI Community License (up to $1 million in revenue). 🎵 New and improved capabilities include variable-length generation up to six minutes, and full song composition on portable devices, no GPU required. ✅ Trained on a fully licensed dataset. 🎨 You can customize the models on your own library with support for LoRa training, which we’ve documented for the first time. More on the models 👇show more

Stability AI
154,029 Aufrufe • vor 1 Monat
Depth Anything 3 now runs as pure C++/ggml (ggml)... . No Python, no PyTorch, no CUDA toolkit at inference, just one self-contained GGUF. It's faster than PyTorch on CPU! and ties speed on GPU. The CPU win came from the last place..I'd have looked. Quantized GGUF on Hugging Face🤗 Shout out to Georgi Gerganov for ggml (we are building a ggml-world!❤️) and to ByteDance Open Source and Depth Anything 3 authors Bingyi Kang Jun Hao Liew Donny Y. Chen !show more

Ettore Di Giacinto
33,985 Aufrufe • vor 5 Tagen
Meet #DBRX: a general-purpose LLM that sets a new... standard for efficient open source models. Use the DBRX model in your RAG apps or use the DBRX design to build your own custom LLMs and improve the quality of your GenAI applications.show more

Databricks
327,704 Aufrufe • vor 2 Jahren