Uploaded: 2026-03-11T07:10:44.000Z
Duration: PT13.683S
Channel: Guri Singh

Holy shit... Microsoft open sourced an inference framework that... runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License.show more

Guri Singh

2,180,357 Aufrufe • vor 3 Monaten

Someone just built a desktop app that that generates... show more

How To Prompt

222,682 Aufrufe • vor 2 Monaten

Google Translate is cooked after this. A developer built... a local AI translation engine that runs 40 languages entirely on your own laptop. It's called LibreTranslate. No API key. No usage limits. No sending your documents to Google's servers. You install it once. It runs forever. Here's what it handles: → Paste text. Translated instantly. → Drop in a file. Outputs the translated version. → Point it at a URL. Returns the page in your language. → Build it into your own app via its local REST API. The speed is not the story. The privacy is. Google Translate reads every sentence you paste into it. Legal contracts. Medical records. Internal emails. Client documents. Every word goes to their servers and stays there. LibreTranslate runs entirely offline. Nothing leaves your machine. Ever. The numbers: → 40 languages supported → Runs on CPU -- no GPU needed → Self-hosted in under 5 minutes → REST API built in for developers → 10K+ stars on GitHub 100% open source. MIT licensed. Price: $0. Google charges nothing for Translate either but it charges you something else. GitHub:show more

Rimsha Bhardwaj

88,180 Aufrufe • vor 9 Tagen

PewDiePie just hit 20K GitHub stars in under 24... hours. The project? Odysseus. A self-hosted AI workspace that runs 100% on your machine. • Agents with tools • MCP built in • Persistent memory • File handling • Windows, macOS, Linux Your data never leaves your device. It supports Ollama, llama.cpp, and vLLM locally with OpenAI and OpenRouter support if you want cloud models too. The crazy part? A YouTuber with 110M+ subscribers just out-shipped most AI startups. And he built half of it using AI.show more

Charlie Hills

16,284 Aufrufe • vor 23 Tagen

Cancelled ChatGPT -> Built JARVIS -> Pays $0 ->... it works offline + it's smarter than the $20/month version. No WiFi needed, no cloud, no API keys, no rate limits, no queues, no $20/month just to ask a server in Virginia for the weather. Just a local model running directly on the laptop hardware, voice activated, system integrated, controlling apps, answering questions, doing the work. Iron Man had JARVIS embedded in his suit, this guy has it embedded in his MacBook and it works on a plane, in a basement, on a remote cabin with zero signal. OpenAI is burning $700,000 a day on infrastructure to deliver something this guy runs for free. Anthropic charges $200/month for unlimited Claude access, microsoft built Copilot into every product they sell. This guy skipped all of it, downloaded a model and made his laptop the smartest device in the room. No subscription. No login. No internet. No data sent anywhere ever. The most powerful AI assistant on earth is now the one running locally on hardware you already own. ChatGPT charges you to think slower, he pays nothing and thinks alone, he made it himself.show more

Defileo🔮

153,466 Aufrufe • vor 1 Monat

🚨 Alibaba just open sourced a GUI agent that... lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)show more

Ihtesham Ali

134,969 Aufrufe • vor 3 Monaten

this is the worst local AI will ever be.... show more

Sudo su

106,710 Aufrufe • vor 3 Monaten

Introducing Pods Hyperspace Pods lets a small group of... people - a family, a startup, a few friends, to pool their laptops and desktops into one AI cluster. Everyone installs the CLI, someone creates a pod, shares an invite link, and the machines form a mesh. Models like Qwen 3.5 32B or GLM-5 Turbo that need more memory than any single laptop has get automatically sharded across the group's devices - layers split proportionally, inference pipelined through the ring. From the outside it looks like one OpenAI-compatible API endpoint with a pk_* key that drops straight into your AI tools and products. No configuration beyond pasting the key and changing the base URL. A team of five paying for cloud AI burns $500–2,000 a month on API calls. The same team's existing machines can serve Qwen 3.5 (competitive on SWE-bench) and GLM-5 Turbo (#1 on BrowseComp for tool-calling and web research) for free - the hardware is already on their desks. When a query genuinely needs a frontier model nobody has locally, the pod falls back to cloud at wholesale rates from a shared treasury. But for the daily work - code reviews, refactors, research, drafting - local models handle it and nobody gets billed. And when it is idle, you can rent out your pod on the compute marketplace, with fine-grained permissions for access management. There's no central server involved in inference. Prompts go from your machine to your pod members' machines and back: all of this enabled by the fully peer-to-peer Hyperspace network. Pod state - who's a member, which API keys are valid, how much treasury is left - is replicated across members with consensus, so the whole thing works on a local network. Members behind home routers don't need port forwarding either. The practical setup for most pods is three models covering different jobs: Qwen 3.5 32B for code and reasoning, GLM-5 Turbo for browsing and research, Gemma 4 for fast lightweight tasks. All running on hardware you already own. Pods ship today in Hyperspace v5.19. Model sharding, API keys, treasury, and Raft coordinator are all live. What Makes This Different - No middleman. Your prompts travel from your IDE to your pod members' hardware and back. There is no server in between reading your data. - No vendor lock-in. Pod membership, API keys, and treasury are replicated across your own machines using Raft consensus. If the internet goes down, your local network keeps working. There is no database in someone else's cloud that your pod depends on. - Automatic sharding. You don't configure layer ranges or calculate VRAM budgets. Tell the pod which model you want. It figures out how to split it across whatever hardware is online. - Real NAT traversal. Your friend behind a home router with a dynamic IP? Works. No VPN, no Tailscale, no port forwarding. The nodes handle it. - Free when local. This is the part that matters most. Cloud AI bills scale with usage. Pod inference on local hardware scales with nothing. The marginal cost of your 10,000th prompt is the electricity your laptop was already using. Coming soon: - Pod federation: pods form alliances with other pods. - Marketplace: pods with spare capacity can sell inference to other pods.show more

Varun

306,336 Aufrufe • vor 2 Monaten

Just dropped on HF — NeuTTS Air Next-gen on-device... show more

steven

72,273 Aufrufe • vor 8 Monaten

🔥 BREAKING: Open source just leveled up AI agents... show more

Shruti

63,497 Aufrufe • vor 11 Monaten

JENSEN HUANG UNVEILED A BOARD THAT RUNS 1 TRILLION... PARAMETER AI MODELS. THE $249 NVIDIA BOX UNDER YOUR DESK KILLS A $200/MONTH AI BILL FOR $5 IN ELECTRICITY jensen held it up on stage with one hand and called it the architecture that runs the future of ai. that same technology now ships in a $249 box smaller than your wallet the jetson orin nano super pulls 7-25 watts and does 67 trillion ai operations per second. llama 3, mistral and deepseek run locally with no api fees and no data leaving your machine most developers pay $2,400 a year across chatgpt, openai api, claude pro and cursor. the jetson costs $314 in year one and $60 a year after. 2 year savings hit $4,431 install ollama with one command, change one line of code to point at localhost, and every tool built for openai works identically. zero rewrites, zero rate limits cloud subscriptions keep getting more expensive and rate limits keep getting tighter. the people who own the box in 2026 are going to look very far ahead in 2028 bookmark this and read the article belowshow more

starmex

54,309 Aufrufe • vor 24 Tagen

This Chinese developer launched Llama 70B locally on a... MacBook on a plane and for a full 11 hours without internet ran client projects. He was sitting by the window on a transatlantic flight with a MacBook Pro M4 with 64 GB of memory. WiFi on board cost $25 for the flight. He declined. No cloud API, no connection to Anthropic or OpenAI servers, no internet at all. Just a local Llama 3.3 70B on bf16 and his own orchestrator script. The model runs through llama.cpp. Generation speed, 71 tokens per second. Context around 60,000 tokens. Memory usage, 48.6 GiB out of 64. Battery at takeoff, 3 hours 21 minutes. And he gave the orchestrator this system prompt before takeoff: "You are an offline orchestrator running on a single MacBook. There is no network. The only resources you have are local files in /Users/dev/work, the Llama 70B inference server at localhost:8080, and a battery budget of 3 hours 21 minutes. Process the queue at /Users/dev/work/queue.jsonl (one client task per line). For each task: draft → run local evals → save artefact to /Users/dev/work/done/. Save context checkpoints every 12 tasks so you can resume after a battery swap. Stop only on empty queue or when battery drops below 5%." So the system knows exactly what resources it is running on. It knows it has no connection to the outside world for the next 11 hours. It knows it has finite memory and a finite battery. It knows the human will not intervene until the plane lands. The system runs in 1 loop. Takes a task from the queue, runs it through inference, saves the artifact, writes a checkpoint. Task after task, just like that. And only when the battery drops below 5% does the orchestrator automatically pause, waits for the laptop to switch to the backup power bank, and continues from the last checkpoint. Here is what the system actually writes in his log during the flight: "saved context checkpoint 8 of 12 (pos_min = 488, pos_max = 50118, size = 62.813 MiB)" "restored context checkpoint (pos_min = 488, pos_max = 50118)" "prompt processing progress: n_tokens = 50 / 60 818" "task 37016 done | tps = 71 s tokens text → /Users/dev/work/done/proposal_westside.md" Outside the window, clouds, blue sky, and no WiFi. On the tray, 1 MacBook, an open terminal on 2 screens, and an inference server on localhost. From what I have observed, this is the cleanest offline AI workflow I have seen in the past year: 11 hours of flight, $0 for WiFi, and the entire client queue closed before landing.show more

Blaze

1,824,930 Aufrufe • vor 1 Monat

NVIDIA just made paying for AI feel optional. Open... show more

shmidt

294,954 Aufrufe • vor 18 Tagen

BlackBird now runs on 8GB RAM Macs. No GPU.... show more

Hina Dixit

1,233,525 Aufrufe • vor 1 Jahr

Today we’re open-sourcing Stable Audio Open Small, a 341M-parameter... text-to-audio model optimized to run entirely on Arm CPUs. This means 99% of smartphones can now generate music-production samples in seconds, right on-device with no internet required. Built for fast, on-the-go creation, it turns your next quick idea into up to 11 seconds of audio. Generate drum loops, foley, riffs, and textures right where you are. No cords 🔌 just chords 🎹 You can learn more here:show more

Stability AI

94,773 Aufrufe • vor 1 Jahr

Llama 3.2 is the latest open-source AI model from... show more

Akash Network

37,087 Aufrufe • vor 1 Jahr

The first phone where your AI never leaves your... show more

Gaia 🌱

157,416 Aufrufe • vor 9 Monaten

Run Gemma 4 26B MoE on 8GB VRAM with... 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the repliesshow more

Alok

290,161 Aufrufe • vor 17 Tagen

Meet Stable Audio 3.0, the open-weight model family built... for artistic experimentation. This is our open invitation to experiment with generative audio. We believe the best innovations are still waiting to be built. The 4-1-1 on 3.0: 📣 You own your outputs, and can distribute and commercialize them under the Stability AI Community License (up to $1 million in revenue). 🎵 New and improved capabilities include variable-length generation up to six minutes, and full song composition on portable devices, no GPU required. ✅ Trained on a fully licensed dataset. 🎨 You can customize the models on your own library with support for LoRa training, which we’ve documented for the first time. More on the models 👇show more

Stability AI

154,029 Aufrufe • vor 1 Monat

Depth Anything 3 now runs as pure C++/ggml (ggml)... . No Python, no PyTorch, no CUDA toolkit at inference, just one self-contained GGUF. It's faster than PyTorch on CPU! and ties speed on GPU. The CPU win came from the last place..I'd have looked. Quantized GGUF on Hugging Face🤗 Shout out to Georgi Gerganov for ggml (we are building a ggml-world!❤️) and to ByteDance Open Source and Depth Anything 3 authors Bingyi Kang Jun Hao Liew Donny Y. Chen !show more

Ettore Di Giacinto

33,985 Aufrufe • vor 5 Tagen

Meet #DBRX: a general-purpose LLM that sets a new... show more

Databricks

327,704 Aufrufe • vor 2 Jahren

Live Cam