Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Holy shit... Microsoft open sourced an inference framework that runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading... speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License.show more

Guri Singh

59,928 subscribers

2,180,357 görüntüleme • 3 ay önce •via X (Twitter)

Eğitim Sağlık & İyilik Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Someone just built a desktop app that that generates 3D models from images and runs 100% locally. It's called Modly. It runs entirely on your GPU, no cloud, no API bills. Just drop an image and get a 3D mesh. 100% Open Source.

Someone just built a desktop app that that generates 3D models from images and runs 100% locally. It's called Modly. It runs entirely on your GPU, no cloud, no API bills. Just drop an image and get a 3D mesh. 100% Open Source.

How To Prompt

222,503 görüntüleme • 2 ay önce

Google Translate is cooked after this. A developer built a local AI translation engine that runs 40 languages entirely on your own laptop. It's called LibreTranslate. No API key. No usage limits. No sending your documents to Google's servers. You install it once. It runs forever. Here's what it handles: → Paste text. Translated instantly. → Drop in a file. Outputs the translated version. → Point it at a URL. Returns the page in your language. → Build it into your own app via its local REST API. The speed is not the story. The privacy is. Google Translate reads every sentence you paste into it. Legal contracts. Medical records. Internal emails. Client documents. Every word goes to their servers and stays there. LibreTranslate runs entirely offline. Nothing leaves your machine. Ever. The numbers: → 40 languages supported → Runs on CPU -- no GPU needed → Self-hosted in under 5 minutes → REST API built in for developers → 10K+ stars on GitHub 100% open source. MIT licensed. Price: $0. Google charges nothing for Translate either but it charges you something else. GitHub:

Google Translate is cooked after this. A developer built a local AI translation engine that runs 40 languages entirely on your own laptop. It's called LibreTranslate. No API key. No usage limits. No sending your documents to Google's servers. You install it once. It runs forever. Here's what it handles: → Paste text. Translated instantly. → Drop in a file. Outputs the translated version. → Point it at a URL. Returns the page in your language. → Build it into your own app via its local REST API. The speed is not the story. The privacy is. Google Translate reads every sentence you paste into it. Legal contracts. Medical records. Internal emails. Client documents. Every word goes to their servers and stays there. LibreTranslate runs entirely offline. Nothing leaves your machine. Ever. The numbers: → 40 languages supported → Runs on CPU -- no GPU needed → Self-hosted in under 5 minutes → REST API built in for developers → 10K+ stars on GitHub 100% open source. MIT licensed. Price: $0. Google charges nothing for Translate either but it charges you something else. GitHub:

Rimsha Bhardwaj

86,519 görüntüleme • 5 gün önce

PewDiePie just hit 20K GitHub stars in under 24 hours. The project? Odysseus. A self-hosted AI workspace that runs 100% on your machine. • Agents with tools • MCP built in • Persistent memory • File handling • Windows, macOS, Linux Your data never leaves your device. It supports Ollama, llama.cpp, and vLLM locally with OpenAI and OpenRouter support if you want cloud models too. The crazy part? A YouTuber with 110M+ subscribers just out-shipped most AI startups. And he built half of it using AI.

PewDiePie just hit 20K GitHub stars in under 24 hours. The project? Odysseus. A self-hosted AI workspace that runs 100% on your machine. • Agents with tools • MCP built in • Persistent memory • File handling • Windows, macOS, Linux Your data never leaves your device. It supports Ollama, llama.cpp, and vLLM locally with OpenAI and OpenRouter support if you want cloud models too. The crazy part? A YouTuber with 110M+ subscribers just out-shipped most AI startups. And he built half of it using AI.

Charlie Hills

16,254 görüntüleme • 19 gün önce

Cancelled ChatGPT -> Built JARVIS -> Pays $0 -> it works offline + it's smarter than the $20/month version. No WiFi needed, no cloud, no API keys, no rate limits, no queues, no $20/month just to ask a server in Virginia for the weather. Just a local model running directly on the laptop hardware, voice activated, system integrated, controlling apps, answering questions, doing the work. Iron Man had JARVIS embedded in his suit, this guy has it embedded in his MacBook and it works on a plane, in a basement, on a remote cabin with zero signal. OpenAI is burning $700,000 a day on infrastructure to deliver something this guy runs for free. Anthropic charges $200/month for unlimited Claude access, microsoft built Copilot into every product they sell. This guy skipped all of it, downloaded a model and made his laptop the smartest device in the room. No subscription. No login. No internet. No data sent anywhere ever. The most powerful AI assistant on earth is now the one running locally on hardware you already own. ChatGPT charges you to think slower, he pays nothing and thinks alone, he made it himself.

Cancelled ChatGPT -> Built JARVIS -> Pays $0 -> it works offline + it's smarter than the $20/month version. No WiFi needed, no cloud, no API keys, no rate limits, no queues, no $20/month just to ask a server in Virginia for the weather. Just a local model running directly on the laptop hardware, voice activated, system integrated, controlling apps, answering questions, doing the work. Iron Man had JARVIS embedded in his suit, this guy has it embedded in his MacBook and it works on a plane, in a basement, on a remote cabin with zero signal. OpenAI is burning $700,000 a day on infrastructure to deliver something this guy runs for free. Anthropic charges $200/month for unlimited Claude access, microsoft built Copilot into every product they sell. This guy skipped all of it, downloaded a model and made his laptop the smartest device in the room. No subscription. No login. No internet. No data sent anywhere ever. The most powerful AI assistant on earth is now the one running locally on hardware you already own. ChatGPT charges you to think slower, he pays nothing and thinks alone, he made it himself.

Defileo🔮

153,466 görüntüleme • 1 ay önce

🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)

🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)

Ihtesham Ali

134,969 görüntüleme • 3 ay önce

this is the worst local AI will ever be. tomorrow it gets faster. next month the models get smarter. next year your GPU runs what a data center runs today. Qwen3.5-35B-A3B on a single 3090. told it to visualize its own expert routing. 256 experts, 8 active per token, rendered in 3D on the same GPU running inference. no API key. no subscription. no permission needed. closed AI isn't losing ground. it's losing the argument.

this is the worst local AI will ever be. tomorrow it gets faster. next month the models get smarter. next year your GPU runs what a data center runs today. Qwen3.5-35B-A3B on a single 3090. told it to visualize its own expert routing. 256 experts, 8 active per token, rendered in 3D on the same GPU running inference. no API key. no subscription. no permission needed. closed AI isn't losing ground. it's losing the argument.

Sudo su

106,710 görüntüleme • 3 ay önce

Introducing Pods Hyperspace Pods lets a small group of people - a family, a startup, a few friends, to pool their laptops and desktops into one AI cluster. Everyone installs the CLI, someone creates a pod, shares an invite link, and the machines form a mesh. Models like Qwen 3.5 32B or GLM-5 Turbo that need more memory than any single laptop has get automatically sharded across the group's devices - layers split proportionally, inference pipelined through the ring. From the outside it looks like one OpenAI-compatible API endpoint with a pk_* key that drops straight into your AI tools and products. No configuration beyond pasting the key and changing the base URL. A team of five paying for cloud AI burns $500–2,000 a month on API calls. The same team's existing machines can serve Qwen 3.5 (competitive on SWE-bench) and GLM-5 Turbo (#1 on BrowseComp for tool-calling and web research) for free - the hardware is already on their desks. When a query genuinely needs a frontier model nobody has locally, the pod falls back to cloud at wholesale rates from a shared treasury. But for the daily work - code reviews, refactors, research, drafting - local models handle it and nobody gets billed. And when it is idle, you can rent out your pod on the compute marketplace, with fine-grained permissions for access management. There's no central server involved in inference. Prompts go from your machine to your pod members' machines and back: all of this enabled by the fully peer-to-peer Hyperspace network. Pod state - who's a member, which API keys are valid, how much treasury is left - is replicated across members with consensus, so the whole thing works on a local network. Members behind home routers don't need port forwarding either. The practical setup for most pods is three models covering different jobs: Qwen 3.5 32B for code and reasoning, GLM-5 Turbo for browsing and research, Gemma 4 for fast lightweight tasks. All running on hardware you already own. Pods ship today in Hyperspace v5.19. Model sharding, API keys, treasury, and Raft coordinator are all live. What Makes This Different - No middleman. Your prompts travel from your IDE to your pod members' hardware and back. There is no server in between reading your data. - No vendor lock-in. Pod membership, API keys, and treasury are replicated across your own machines using Raft consensus. If the internet goes down, your local network keeps working. There is no database in someone else's cloud that your pod depends on. - Automatic sharding. You don't configure layer ranges or calculate VRAM budgets. Tell the pod which model you want. It figures out how to split it across whatever hardware is online. - Real NAT traversal. Your friend behind a home router with a dynamic IP? Works. No VPN, no Tailscale, no port forwarding. The nodes handle it. - Free when local. This is the part that matters most. Cloud AI bills scale with usage. Pod inference on local hardware scales with nothing. The marginal cost of your 10,000th prompt is the electricity your laptop was already using. Coming soon: - Pod federation: pods form alliances with other pods. - Marketplace: pods with spare capacity can sell inference to other pods.

Introducing Pods Hyperspace Pods lets a small group of people - a family, a startup, a few friends, to pool their laptops and desktops into one AI cluster. Everyone installs the CLI, someone creates a pod, shares an invite link, and the machines form a mesh. Models like Qwen 3.5 32B or GLM-5 Turbo that need more memory than any single laptop has get automatically sharded across the group's devices - layers split proportionally, inference pipelined through the ring. From the outside it looks like one OpenAI-compatible API endpoint with a pk_* key that drops straight into your AI tools and products. No configuration beyond pasting the key and changing the base URL. A team of five paying for cloud AI burns $500–2,000 a month on API calls. The same team's existing machines can serve Qwen 3.5 (competitive on SWE-bench) and GLM-5 Turbo (#1 on BrowseComp for tool-calling and web research) for free - the hardware is already on their desks. When a query genuinely needs a frontier model nobody has locally, the pod falls back to cloud at wholesale rates from a shared treasury. But for the daily work - code reviews, refactors, research, drafting - local models handle it and nobody gets billed. And when it is idle, you can rent out your pod on the compute marketplace, with fine-grained permissions for access management. There's no central server involved in inference. Prompts go from your machine to your pod members' machines and back: all of this enabled by the fully peer-to-peer Hyperspace network. Pod state - who's a member, which API keys are valid, how much treasury is left - is replicated across members with consensus, so the whole thing works on a local network. Members behind home routers don't need port forwarding either. The practical setup for most pods is three models covering different jobs: Qwen 3.5 32B for code and reasoning, GLM-5 Turbo for browsing and research, Gemma 4 for fast lightweight tasks. All running on hardware you already own. Pods ship today in Hyperspace v5.19. Model sharding, API keys, treasury, and Raft coordinator are all live. What Makes This Different - No middleman. Your prompts travel from your IDE to your pod members' hardware and back. There is no server in between reading your data. - No vendor lock-in. Pod membership, API keys, and treasury are replicated across your own machines using Raft consensus. If the internet goes down, your local network keeps working. There is no database in someone else's cloud that your pod depends on. - Automatic sharding. You don't configure layer ranges or calculate VRAM budgets. Tell the pod which model you want. It figures out how to split it across whatever hardware is online. - Real NAT traversal. Your friend behind a home router with a dynamic IP? Works. No VPN, no Tailscale, no port forwarding. The nodes handle it. - Free when local. This is the part that matters most. Cloud AI bills scale with usage. Pod inference on local hardware scales with nothing. The marginal cost of your 10,000th prompt is the electricity your laptop was already using. Coming soon: - Pod federation: pods form alliances with other pods. - Marketplace: pods with spare capacity can sell inference to other pods.

Varun

305,210 görüntüleme • 2 ay önce

Just dropped on HF — NeuTTS Air Next-gen on-device TTS that matches cloud-level quality while staying fully open source. > Real-time speech synthesis on CPU/GPU > 3-second voice cloning, no cloud or data upload > Compact: under 200 MB, runs on mobile and edge devices > Multilingual and expressive > Developed by Neuphonic , optimized for speed and fidelity

Just dropped on HF — NeuTTS Air Next-gen on-device TTS that matches cloud-level quality while staying fully open source. > Real-time speech synthesis on CPU/GPU > 3-second voice cloning, no cloud or data upload > Compact: under 200 MB, runs on mobile and edge devices > Multilingual and expressive > Developed by Neuphonic , optimized for speed and fidelity

steven

72,273 görüntüleme • 8 ay önce

🔥 BREAKING: Open source just leveled up AI agents Eigent gives you a fully local, customizable AI workforce....built to run on your laptop. → No vendor lock-in → No cloud dependency → 100% open source Just fast, private, parallel agents you control (Here's how):👇

🔥 BREAKING: Open source just leveled up AI agents Eigent gives you a fully local, customizable AI workforce....built to run on your laptop. → No vendor lock-in → No cloud dependency → 100% open source Just fast, private, parallel agents you control (Here's how):👇

Shruti

63,497 görüntüleme • 10 ay önce

JENSEN HUANG UNVEILED A BOARD THAT RUNS 1 TRILLION PARAMETER AI MODELS. THE $249 NVIDIA BOX UNDER YOUR DESK KILLS A $200/MONTH AI BILL FOR $5 IN ELECTRICITY jensen held it up on stage with one hand and called it the architecture that runs the future of ai. that same technology now ships in a $249 box smaller than your wallet the jetson orin nano super pulls 7-25 watts and does 67 trillion ai operations per second. llama 3, mistral and deepseek run locally with no api fees and no data leaving your machine most developers pay $2,400 a year across chatgpt, openai api, claude pro and cursor. the jetson costs $314 in year one and $60 a year after. 2 year savings hit $4,431 install ollama with one command, change one line of code to point at localhost, and every tool built for openai works identically. zero rewrites, zero rate limits cloud subscriptions keep getting more expensive and rate limits keep getting tighter. the people who own the box in 2026 are going to look very far ahead in 2028 bookmark this and read the article below

JENSEN HUANG UNVEILED A BOARD THAT RUNS 1 TRILLION PARAMETER AI MODELS. THE $249 NVIDIA BOX UNDER YOUR DESK KILLS A $200/MONTH AI BILL FOR $5 IN ELECTRICITY jensen held it up on stage with one hand and called it the architecture that runs the future of ai. that same technology now ships in a $249 box smaller than your wallet the jetson orin nano super pulls 7-25 watts and does 67 trillion ai operations per second. llama 3, mistral and deepseek run locally with no api fees and no data leaving your machine most developers pay $2,400 a year across chatgpt, openai api, claude pro and cursor. the jetson costs $314 in year one and $60 a year after. 2 year savings hit $4,431 install ollama with one command, change one line of code to point at localhost, and every tool built for openai works identically. zero rewrites, zero rate limits cloud subscriptions keep getting more expensive and rate limits keep getting tighter. the people who own the box in 2026 are going to look very far ahead in 2028 bookmark this and read the article below

starmex

54,309 görüntüleme • 20 gün önce

This Chinese developer launched Llama 70B locally on a MacBook on a plane and for a full 11 hours without internet ran client projects. He was sitting by the window on a transatlantic flight with a MacBook Pro M4 with 64 GB of memory. WiFi on board cost $25 for the flight. He declined. No cloud API, no connection to Anthropic or OpenAI servers, no internet at all. Just a local Llama 3.3 70B on bf16 and his own orchestrator script. The model runs through llama.cpp. Generation speed, 71 tokens per second. Context around 60,000 tokens. Memory usage, 48.6 GiB out of 64. Battery at takeoff, 3 hours 21 minutes. And he gave the orchestrator this system prompt before takeoff: "You are an offline orchestrator running on a single MacBook. There is no network. The only resources you have are local files in /Users/dev/work, the Llama 70B inference server at localhost:8080, and a battery budget of 3 hours 21 minutes. Process the queue at /Users/dev/work/queue.jsonl (one client task per line). For each task: draft → run local evals → save artefact to /Users/dev/work/done/. Save context checkpoints every 12 tasks so you can resume after a battery swap. Stop only on empty queue or when battery drops below 5%." So the system knows exactly what resources it is running on. It knows it has no connection to the outside world for the next 11 hours. It knows it has finite memory and a finite battery. It knows the human will not intervene until the plane lands. The system runs in 1 loop. Takes a task from the queue, runs it through inference, saves the artifact, writes a checkpoint. Task after task, just like that. And only when the battery drops below 5% does the orchestrator automatically pause, waits for the laptop to switch to the backup power bank, and continues from the last checkpoint. Here is what the system actually writes in his log during the flight: "saved context checkpoint 8 of 12 (pos_min = 488, pos_max = 50118, size = 62.813 MiB)" "restored context checkpoint (pos_min = 488, pos_max = 50118)" "prompt processing progress: n_tokens = 50 / 60 818" "task 37016 done | tps = 71 s tokens text → /Users/dev/work/done/proposal_westside.md" Outside the window, clouds, blue sky, and no WiFi. On the tray, 1 MacBook, an open terminal on 2 screens, and an inference server on localhost. From what I have observed, this is the cleanest offline AI workflow I have seen in the past year: 11 hours of flight, $0 for WiFi, and the entire client queue closed before landing.

This Chinese developer launched Llama 70B locally on a MacBook on a plane and for a full 11 hours without internet ran client projects. He was sitting by the window on a transatlantic flight with a MacBook Pro M4 with 64 GB of memory. WiFi on board cost $25 for the flight. He declined. No cloud API, no connection to Anthropic or OpenAI servers, no internet at all. Just a local Llama 3.3 70B on bf16 and his own orchestrator script. The model runs through llama.cpp. Generation speed, 71 tokens per second. Context around 60,000 tokens. Memory usage, 48.6 GiB out of 64. Battery at takeoff, 3 hours 21 minutes. And he gave the orchestrator this system prompt before takeoff: "You are an offline orchestrator running on a single MacBook. There is no network. The only resources you have are local files in /Users/dev/work, the Llama 70B inference server at localhost:8080, and a battery budget of 3 hours 21 minutes. Process the queue at /Users/dev/work/queue.jsonl (one client task per line). For each task: draft → run local evals → save artefact to /Users/dev/work/done/. Save context checkpoints every 12 tasks so you can resume after a battery swap. Stop only on empty queue or when battery drops below 5%." So the system knows exactly what resources it is running on. It knows it has no connection to the outside world for the next 11 hours. It knows it has finite memory and a finite battery. It knows the human will not intervene until the plane lands. The system runs in 1 loop. Takes a task from the queue, runs it through inference, saves the artifact, writes a checkpoint. Task after task, just like that. And only when the battery drops below 5% does the orchestrator automatically pause, waits for the laptop to switch to the backup power bank, and continues from the last checkpoint. Here is what the system actually writes in his log during the flight: "saved context checkpoint 8 of 12 (pos_min = 488, pos_max = 50118, size = 62.813 MiB)" "restored context checkpoint (pos_min = 488, pos_max = 50118)" "prompt processing progress: n_tokens = 50 / 60 818" "task 37016 done | tps = 71 s tokens text → /Users/dev/work/done/proposal_westside.md" Outside the window, clouds, blue sky, and no WiFi. On the tray, 1 MacBook, an open terminal on 2 screens, and an inference server on localhost. From what I have observed, this is the cleanest offline AI workflow I have seen in the past year: 11 hours of flight, $0 for WiFi, and the entire client queue closed before landing.

Blaze

1,824,930 görüntüleme • 1 ay önce

BlackBird now runs on 8GB RAM Macs. No GPU. No cloud. Just fast, private AI agents - right on your MacBook Air. We optimized memory, speed, and thermal performance so anyone can build with AI. Try it: Next Stop: Windows Beta Drops This Week! DM Me if you want to try it. #OnDeviceAI #BlackBird #AIforEveryone #macOS

BlackBird now runs on 8GB RAM Macs. No GPU. No cloud. Just fast, private AI agents - right on your MacBook Air. We optimized memory, speed, and thermal performance so anyone can build with AI. Try it: Next Stop: Windows Beta Drops This Week! DM Me if you want to try it. #OnDeviceAI #BlackBird #AIforEveryone #macOS

Hina Dixit

1,233,525 görüntüleme • 1 yıl önce

NVIDIA just made paying for AI feel optional. Open model, a million tokens of context, free tier with no per-token cost, runs on your own hardware. Entire codebases, whole data rooms, a year of chat logs, all swallowed in one prompt. No chunking, no RAG, no rate limit theater. The closed-AI premium has 90 days to defend itself. Bookmark this and come back. Open beat closed. Again.

NVIDIA just made paying for AI feel optional. Open model, a million tokens of context, free tier with no per-token cost, runs on your own hardware. Entire codebases, whole data rooms, a year of chat logs, all swallowed in one prompt. No chunking, no RAG, no rate limit theater. The closed-AI premium has 90 days to defend itself. Bookmark this and come back. Open beat closed. Again.

shmidt

294,327 görüntüleme • 15 gün önce

Today we’re open-sourcing Stable Audio Open Small, a 341M-parameter text-to-audio model optimized to run entirely on Arm CPUs. This means 99% of smartphones can now generate music-production samples in seconds, right on-device with no internet required. Built for fast, on-the-go creation, it turns your next quick idea into up to 11 seconds of audio. Generate drum loops, foley, riffs, and textures right where you are. No cords 🔌 just chords 🎹 You can learn more here:

Today we’re open-sourcing Stable Audio Open Small, a 341M-parameter text-to-audio model optimized to run entirely on Arm CPUs. This means 99% of smartphones can now generate music-production samples in seconds, right on-device with no internet required. Built for fast, on-the-go creation, it turns your next quick idea into up to 11 seconds of audio. Generate drum loops, foley, riffs, and textures right where you are. No cords 🔌 just chords 🎹 You can learn more here:

Stability AI

94,773 görüntüleme • 1 yıl önce

The first phone where your AI never leaves your device. No cloud processing. No data harvesting. Complete AI sovereignty. Built on Galaxy S25 Edge hardware. Earn rewards through the Gaia network. 1,000 units now available. Additional releases planned.

The first phone where your AI never leaves your device. No cloud processing. No data harvesting. Complete AI sovereignty. Built on Galaxy S25 Edge hardware. Earn rewards through the Gaia network. 1,000 units now available. Additional releases planned.

Gaia 🌱

157,416 görüntüleme • 9 ay önce

Llama 3.2 is the latest open-source AI model from Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:

Llama 3.2 is the latest open-source AI model from Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:

Akash Network

37,087 görüntüleme • 1 yıl önce

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware. Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card. The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?" Today, I’m delivering exactly that. I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!. If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed. The performance metrics are astonishing: - 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame. # What about prefill? Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable. And this is running completely without Multi Token Prediction (MTP) active. How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4. The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse. # The Test Setup: CPU: Intel Core i7 RAM: 16GB System RAM GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM) # The Secret Sauce (The -cmoe Flag) To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp. This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache. It prevents VRAM spillage and holds the throughput rock solid. # The flags: -m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking. Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

Alok

289,653 görüntüleme • 14 gün önce

Meet Stable Audio 3.0, the open-weight model family built for artistic experimentation. This is our open invitation to experiment with generative audio. We believe the best innovations are still waiting to be built. The 4-1-1 on 3.0: 📣 You own your outputs, and can distribute and commercialize them under the Stability AI Community License (up to $1 million in revenue). 🎵 New and improved capabilities include variable-length generation up to six minutes, and full song composition on portable devices, no GPU required. ✅ Trained on a fully licensed dataset. 🎨 You can customize the models on your own library with support for LoRa training, which we’ve documented for the first time. More on the models 👇

Meet Stable Audio 3.0, the open-weight model family built for artistic experimentation. This is our open invitation to experiment with generative audio. We believe the best innovations are still waiting to be built. The 4-1-1 on 3.0: 📣 You own your outputs, and can distribute and commercialize them under the Stability AI Community License (up to $1 million in revenue). 🎵 New and improved capabilities include variable-length generation up to six minutes, and full song composition on portable devices, no GPU required. ✅ Trained on a fully licensed dataset. 🎨 You can customize the models on your own library with support for LoRa training, which we’ve documented for the first time. More on the models 👇

Stability AI

154,029 görüntüleme • 1 ay önce

Meet #DBRX: a general-purpose LLM that sets a new standard for efficient open source models. Use the DBRX model in your RAG apps or use the DBRX design to build your own custom LLMs and improve the quality of your GenAI applications.

Meet #DBRX: a general-purpose LLM that sets a new standard for efficient open source models. Use the DBRX model in your RAG apps or use the DBRX design to build your own custom LLMs and improve the quality of your GenAI applications.

Databricks

327,704 görüntüleme • 2 yıl önce

here's how the whole thing works. claude code doesn't care what's behind the API. it just sends requests and expects responses. so i pointed it at my own machine instead of anthropic's servers. llama-server runs the model locally. LiteLLM sits in between and translates the API format. claude code thinks it's talking to claude. it's talking to qwen on localhost. the setup: 2x 3090s, 38 layers on GPU, 10 on CPU. 128K context window. generation is only 7 tok/s but the tradeoff is worth it. 128K means the agent can hold an entire project in memory without losing context midtask. claude code alone loads a 17.5K token system prompt on every request. tool definitions, safety rules, agent behavior. that's your baseline before you even say hello. pushed as far as i could tonight. what surprised me most wasn't the speed. it was the iteration quality. first prompt gave me a working particle sim. second prompt, the model read its own 564 lines, understood the architecture, and added trails, explosions, gravity wells, bloom effects. no handholding. 4bit quantized. 45GB on two consumer cards. running a full coding agent autonomously. detailed article coming. full benchmarks, hardware breakdowns, engine debugging, code quality. everything from setup to what broke and why.

here's how the whole thing works. claude code doesn't care what's behind the API. it just sends requests and expects responses. so i pointed it at my own machine instead of anthropic's servers. llama-server runs the model locally. LiteLLM sits in between and translates the API format. claude code thinks it's talking to claude. it's talking to qwen on localhost. the setup: 2x 3090s, 38 layers on GPU, 10 on CPU. 128K context window. generation is only 7 tok/s but the tradeoff is worth it. 128K means the agent can hold an entire project in memory without losing context midtask. claude code alone loads a 17.5K token system prompt on every request. tool definitions, safety rules, agent behavior. that's your baseline before you even say hello. pushed as far as i could tonight. what surprised me most wasn't the speed. it was the iteration quality. first prompt gave me a working particle sim. second prompt, the model read its own 564 lines, understood the architecture, and added trails, explosions, gravity wells, bloom effects. no handholding. 4bit quantized. 45GB on two consumer cards. running a full coding agent autonomously. detailed article coming. full benchmarks, hardware breakdowns, engine debugging, code quality. everything from setup to what broke and why.

Sudo su

37,580 görüntüleme • 4 ay önce