Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Microsoft researchers release bitnet.cpp, the official inference framework for 1-bit LLMs like BitNet b1.58. It has optimized kernels for fast, lossless inference on CPUs, achieving impressive speedups on ARM and x86 CPUs and significant energy reductions.

Microsoft Research

554,490 subscribers

75,161 views • 1 year ago •via X (Twitter)

Health & Wellness News & Politics Science & Technology

Anya Rossi• Live Now

Private livecam show

9 Comments

BensenHsu1 year ago

This study introduces bitnet.cpp, a software stack designed to enable fast and efficient inference of 1-bit large language models (LLMs), such as BitNet b1.58, on CPUs. The researchers aim to unlock the potential of 1-bit LLMs by developing optimized kernels that can achieve significant speedups and reduce energy consumption compared to existing solutions. The results show that bitnet.cpp significantly outperforms the existing llama.cpp framework in terms of both inference speed and energy consumption: • On the Apple M2 Ultra, bitnet.cpp achieves speedups ranging from 1.37x to 5.07x, with larger models experiencing greater performance gains. • On the Intel i7-13700H, bitnet.cpp achieves speedups ranging from 2.37x to 6.17%, with significant improvements for larger models. • bitnet.cpp reduces energy consumption by 55.4% to 70.0% on the Apple M2 Ultra and 71.9% to 82.2% on the Intel i7-13700H, depending on the model size. full paper:

Emily1 year ago

I hope we see actual world results like LIama 3.2 and other LLMs.

Sandro Hanea1 year ago

Cool work! Played a bit with it and indeed it is degrading the quality a bit, but there are definitely usecases for it. Also, worth mentioning that this builds on top of @ggerganov 's llama.cpp and the most of the inference is still using ggml.

Paul Calcraft1 year ago

Will you release your own 1.58 bitnet models?

Srini Gundelli1 year ago

🫶🏼🤯

ΜΛΛNΙ1 year ago

so fast, but sadly there's still degradation of quality...but it's so much better than a say 4bit model of the same size, which is a huge leap. if there's a way to mediate the degradation, you're golden.

تطوير الالعاب - Ludology1 year ago

Can this ported to snes ? , jk

🙂🙏 Özv. Dízelné Hadházy Aranka, 1.8T1 year ago

Ecosystem services include water , air, soil, energy, and biodiversity. Ecosystem services also include water, air, soil, energy, and biodiversity. Ecosystem services also include water, air, soil, energy, and biodiversity. Excellent essay!

Desmond1 year ago

In the implementation of bitnet.cpp’s TL2 Kernel, which compresses every three weights into a 5-bit index with a 1-bit sign, how does the LUT method handle potential collisions or overlapping index values during the computation phase, especially in scenarios involving high-dimensional matrices?

Related Videos

Today, AMD and Microsoft are announcing an expanded strategic partnership to deliver a full stack of AMD AI solutions for Microsoft Azure, spanning GPUs, CPUs, networking and software. At the center of this expansion, Microsoft will deploy AMD Helios to power frontier model AI inference for Microsoft, its AI customers and support Azure AI services. Read more on the news ahead of #AdvancingAI:

Today, AMD and Microsoft are announcing an expanded strategic partnership to deliver a full stack of AMD AI solutions for Microsoft Azure, spanning GPUs, CPUs, networking and software. At the center of this expansion, Microsoft will deploy AMD Helios to power frontier model AI inference for Microsoft, its AI customers and support Azure AI services. Read more on the news ahead of #AdvancingAI:

AMD

151,054 views • 6 days ago

✨ Introducing Gemma 3n, available in early preview today. The model uses a cutting-edge architecture optimized for mobile on-device usage. It brings multimodality, super fast inference, and more.

✨ Introducing Gemma 3n, available in early preview today. The model uses a cutting-edge architecture optimized for mobile on-device usage. It brings multimodality, super fast inference, and more.

Google AI Developers

125,226 views • 1 year ago

Dylan Patel says GPUs are no longer the biggest bottleneck. According to Dylan Patel, now CPUs are the constraint. In the early AI era, CPUs were the laggers. You used them for storage, checkpointing, pre-processing, etc. (pretty light workloads) The models weren't agentic and couldn't go step by step. Just string in and string out (simple inference) Then OpenAI launched O1 preview in September '24, and RL training loops have since tightened every month. - initially it was checking model output with regex - then running classifiers - followed by code unit tests + compilation - and finally agentic flows calling databases & scientific simulations The model outputs to an environment, gets verified, and trains on it. Coding agent revenue went from a couple billion to north of $10B in roughly 6 months. Something like Codex 5.4 can work agentically on its own for 6-7 hrs straight - doing all sorts of calls (databases, cron servers, scraping) That requires insane CPU capabilities. And over the last two quarters, the entire cloud market ran out of CPUs. - GitHub has been really unstable lately - Amazon's CPU server installations 3x'd year over year - Microsoft sold all of its spare CPUs to Anthropic & OpenAI Earlier, it was 100 megawatts of GPUs served by 1 megawatt of CPUs. Now that ratio is getting much closer for both RL training and agentic inference. There's simply no capacity anywhere, and it's causing massive instability.

Dylan Patel says GPUs are no longer the biggest bottleneck. According to Dylan Patel, now CPUs are the constraint. In the early AI era, CPUs were the laggers. You used them for storage, checkpointing, pre-processing, etc. (pretty light workloads) The models weren't agentic and couldn't go step by step. Just string in and string out (simple inference) Then OpenAI launched O1 preview in September '24, and RL training loops have since tightened every month. - initially it was checking model output with regex - then running classifiers - followed by code unit tests + compilation - and finally agentic flows calling databases & scientific simulations The model outputs to an environment, gets verified, and trains on it. Coding agent revenue went from a couple billion to north of $10B in roughly 6 months. Something like Codex 5.4 can work agentically on its own for 6-7 hrs straight - doing all sorts of calls (databases, cron servers, scraping) That requires insane CPU capabilities. And over the last two quarters, the entire cloud market ran out of CPUs. - GitHub has been really unstable lately - Amazon's CPU server installations 3x'd year over year - Microsoft sold all of its spare CPUs to Anthropic & OpenAI Earlier, it was 100 megawatts of GPUs served by 1 megawatt of CPUs. Now that ratio is getting much closer for both RL training and agentic inference. There's simply no capacity anywhere, and it's causing massive instability.

Ivan Burazin

303,474 views • 3 months ago

Jesse Pollak on the ecosystem forming around tokenized inference "There's this really cool tokenized inference market that Venice has innovated, where they've taken inference and turned it into a token called DIEM where you can buy it and you get $1 of inference per day" "That innovation of tokenizing inference has unlocked a ton of innovation around it, where people are using that tokenized inference to do things" "There's other projects that let people sell their tokenized inference at a discount so other people can get cheaper inference and the whole system works more efficiently" "I feel like that's the first time we're really seeing the intersection of AI plus markets plus crypto and it's thanks to Venice and it's happening on Base"

Jesse Pollak on the ecosystem forming around tokenized inference "There's this really cool tokenized inference market that Venice has innovated, where they've taken inference and turned it into a token called DIEM where you can buy it and you get $1 of inference per day" "That innovation of tokenizing inference has unlocked a ton of innovation around it, where people are using that tokenized inference to do things" "There's other projects that let people sell their tokenized inference at a discount so other people can get cheaper inference and the whole system works more efficiently" "I feel like that's the first time we're really seeing the intersection of AI plus markets plus crypto and it's thanks to Venice and it's happening on Base"

Market Bubble

25,780 views • 6 days ago

Dylan Patel: “CPUs are sold out.” $INTC earnings and conference call confirmed the looming CPU shortage. There are three main drivers: - Agentic systems, as CPUs do the planning and orchestration. - Increasing complexity of and demand for reinforcement learning environments that run on CPUs. - Increasing deployment of AI instilled apps. Models run on GPUs but apps themselves run on CPUs. Demand for all these will grow secularly as AI systems will only get more agentic, RL environments will only get more complex, and deployment will only accelerate. As a result, GPU-to-CPU ratio in AI workloads will move from 1:8 to 1:2 or maybe even 1:1 in the future. CPU supercycle is about to start. $AMD $INTC $ARM

Dylan Patel: “CPUs are sold out.” $INTC earnings and conference call confirmed the looming CPU shortage. There are three main drivers: - Agentic systems, as CPUs do the planning and orchestration. - Increasing complexity of and demand for reinforcement learning environments that run on CPUs. - Increasing deployment of AI instilled apps. Models run on GPUs but apps themselves run on CPUs. Demand for all these will grow secularly as AI systems will only get more agentic, RL environments will only get more complex, and deployment will only accelerate. As a result, GPU-to-CPU ratio in AI workloads will move from 1:8 to 1:2 or maybe even 1:1 in the future. CPU supercycle is about to start. $AMD $INTC $ARM

Oguz Erkan

93,419 views • 3 months ago

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

elvis

261,622 views • 2 years ago

Who owns/uses something like this currently? What do you use it for? I’m bullish on local inference and localmaxxing but prefer the laptop format

Who owns/uses something like this currently? What do you use it for? I’m bullish on local inference and localmaxxing but prefer the laptop format

Sean Geng

72,216 views • 1 month ago

in the lectures below, i hold your hand through low-level LLM systems engineering. it includes everything up to TODAY! 1) pytorch tensors 2) large matmul on cpu vs gpu 3) JAX (and why xAI uses it instead of pytorch) 4) raw cuda kernels and global threading indexing 5) triton design philosophy and softmax example 6) HIP kernels 7) mapping out the ENTIRE ecosystem + differences between CUDA and ROCm/HIP (BLAS, FFT, DNN) 8) cutlass and cute-dsl 9) pretraining, finetuning, rl, unsloth, axolotl, megatron-lm, deepspeed, nanogpt, nanochat 10) training vs inference, inference serving problems, throughput vs latency vs concurrency scaling, vllm, sglang, tensorrt-llm, tensorrt, llama.cpp, exllamav2, exllamav3, benchmark comparisons 11) projects/companies using llms to generate SOTA cuda/triton kernels 12) luminal inference 13) mojo/modular/max

in the lectures below, i hold your hand through low-level LLM systems engineering. it includes everything up to TODAY! 1) pytorch tensors 2) large matmul on cpu vs gpu 3) JAX (and why xAI uses it instead of pytorch) 4) raw cuda kernels and global threading indexing 5) triton design philosophy and softmax example 6) HIP kernels 7) mapping out the ENTIRE ecosystem + differences between CUDA and ROCm/HIP (BLAS, FFT, DNN) 8) cutlass and cute-dsl 9) pretraining, finetuning, rl, unsloth, axolotl, megatron-lm, deepspeed, nanogpt, nanochat 10) training vs inference, inference serving problems, throughput vs latency vs concurrency scaling, vllm, sglang, tensorrt-llm, tensorrt, llama.cpp, exllamav2, exllamav3, benchmark comparisons 11) projects/companies using llms to generate SOTA cuda/triton kernels 12) luminal inference 13) mojo/modular/max

Elliot Arledge

57,855 views • 9 months ago

🚀 How should LLMs sample on hard reasoning problems during post-training and inference where direct rollouts rarely produce a correct answer? Best-of-N (e.g., GRPO) and tree search share two limitations: 🔻 Verification signals are sparse 🔻 Candidates stay within the model's own distribution We introduce BES: Bidirectional Evolutionary Search — a search framework that couples forward candidate evolution with backward goal decomposition. ✅ Works for both post-training and inference.

🚀 How should LLMs sample on hard reasoning problems during post-training and inference where direct rollouts rarely produce a correct answer? Best-of-N (e.g., GRPO) and tree search share two limitations: 🔻 Verification signals are sparse 🔻 Candidates stay within the model's own distribution We introduce BES: Bidirectional Evolutionary Search — a search framework that couples forward candidate evolution with backward goal decomposition. ✅ Works for both post-training and inference.

Guowei Xu

244,684 views • 1 month ago

Before Fable 5 was shut down, it pushed Gemma 4 to 255 tok/s on WebGPU. Some didn't believe it was real. Today we're releasing the demo and kernels it wrote for you to see yourself. Run it locally in your browser. Agentic kernel optimization is the future of on-device inference

Before Fable 5 was shut down, it pushed Gemma 4 to 255 tok/s on WebGPU. Some didn't believe it was real. Today we're releasing the demo and kernels it wrote for you to see yourself. Run it locally in your browser. Agentic kernel optimization is the future of on-device inference

Xenova

482,622 views • 1 month ago

"Inference, if you look at it as a market, will be much, much bigger than cloud computing was pre-ChatGPT." Lightspeed’s Bucky Moore says inference is an underrated investment category in AI, and expects the market to break up into large, specialized platforms for each modality: "The GPU supply crunch that we're seeing right now is largely, as Dylan Patel has said on the show before, due to the fact that not only these consumer products, but also B2B products like Claude Code and Codex are just really taking off and creating insane demand for inference." "We're talking hundreds of billions in spend every year. And if that's true, I think there will be very, very large inference platforms built in each modality." "So there will be an inference platform for real-time video models, there will be an inference platform for open-source and custom language models, there will be an inference platform built specifically for long-running agents." "So I think we're just going to see that industry, which today looks like one industry, break up into many because of how big it is and how much room for specialization there is."

"Inference, if you look at it as a market, will be much, much bigger than cloud computing was pre-ChatGPT." Lightspeed’s Bucky Moore says inference is an underrated investment category in AI, and expects the market to break up into large, specialized platforms for each modality: "The GPU supply crunch that we're seeing right now is largely, as Dylan Patel has said on the show before, due to the fact that not only these consumer products, but also B2B products like Claude Code and Codex are just really taking off and creating insane demand for inference." "We're talking hundreds of billions in spend every year. And if that's true, I think there will be very, very large inference platforms built in each modality." "So there will be an inference platform for real-time video models, there will be an inference platform for open-source and custom language models, there will be an inference platform built specifically for long-running agents." "So I think we're just going to see that industry, which today looks like one industry, break up into many because of how big it is and how much room for specialization there is."

TBPN

33,278 views • 4 months ago

Microsoft just made expensive GPUs useless. And it runs on your 5-year-old laptop. BitNet B1.58 crushes models 10x bigger using 96% less energy. Here's what changed: → Uses ternary weights (only -1, 0, +1) instead of full precision floats → 0.4GB memory usage (runs on a phone, no GPU needed) → 29 milliseconds per response (real-time fast) → Beats Llama 3 and Gemma on benchmarks while using almost no power → Free on HuggingFace, MIT license (download and run today) No API costs. No cloud bills. No data leaving your machine. ChatGPT charges per token. This is free forever. Save this video. Want the SOP? DM me. 💬

Microsoft just made expensive GPUs useless. And it runs on your 5-year-old laptop. BitNet B1.58 crushes models 10x bigger using 96% less energy. Here's what changed: → Uses ternary weights (only -1, 0, +1) instead of full precision floats → 0.4GB memory usage (runs on a phone, no GPU needed) → 29 milliseconds per response (real-time fast) → Beats Llama 3 and Gemma on benchmarks while using almost no power → Free on HuggingFace, MIT license (download and run today) No API costs. No cloud bills. No data leaving your machine. ChatGPT charges per token. This is free forever. Save this video. Want the SOP? DM me. 💬

Julian Goldie SEO

76,615 views • 6 months ago

We previously shared our research on Layer Skip, an end-to-end solution for accelerating LLMs from researchers at Meta FAIR. It achieves this by executing a subset of an LLM’s layers and utilizing subsequent layers for verification and correction. We’re now releasing inference code and fine-tuned checkpoints for this work. Model weights on Hugging Face ➡️ More details ➡️ We hope that releasing this work will open up new areas of experimentation and innovative new research in optimization and interpretability.

We previously shared our research on Layer Skip, an end-to-end solution for accelerating LLMs from researchers at Meta FAIR. It achieves this by executing a subset of an LLM’s layers and utilizing subsequent layers for verification and correction. We’re now releasing inference code and fine-tuned checkpoints for this work. Model weights on Hugging Face ➡️ More details ➡️ We hope that releasing this work will open up new areas of experimentation and innovative new research in optimization and interpretability.

AI at Meta

156,598 views • 1 year ago

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: 📰 Paper: 🎥 Demo: #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: 📰 Paper: 🎥 Demo: #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

Yukang Chen

59,030 views • 2 months ago

$NVDA CEO: “Demand for inference will go up by a billion times.” $AMD will be one of the biggest winners as it has the upper hand in inference thanks to larger memory bandwith and capacity of its chips. Stock is still at 22x 2027 earnings. How can’t you be bullish on $AMD?

$NVDA CEO: “Demand for inference will go up by a billion times.” $AMD will be one of the biggest winners as it has the upper hand in inference thanks to larger memory bandwith and capacity of its chips. Stock is still at 22x 2027 earnings. How can’t you be bullish on $AMD?

Oguz Erkan

282,229 views • 3 months ago

Microsoft sold every spare CPU it had to Anthropic and OpenAI. Amazon tripled its CPU buys year over year and still can't keep up. Two of AWS's biggest customers asked Andy Jassy if they could buy the entire 2026 production run of Graviton chips. He said no. The ratio inside an AI datacenter used to be 100 megawatts of GPUs to 1 megawatt of CPUs. CPUs handled storage, checkpointing, pre-processing. Light work. GPUs did the actual training and inference. Then OpenAI shipped o1-preview in September 2024. RL post-training went from "check the model output with a regex" to "run classifiers" to "compile the code and run the unit tests" to "spin up a sandbox, call three databases, run a physics simulation, verify the answer." Every rollout now needs a CPU-backed environment to verify against. Codex 5.4 runs agentically for 6-7 hours at a time. Each database call, each cron job, each scraped URL is CPU work. Coding agent revenue went from a couple billion to north of $10B in six months. That compute is sitting on CPUs. The CPU to GPU ratio is now approaching 1:1. The entire global cloud was built for 1:8. That's why GitHub has been unstable for weeks. Nvidia and Arm both announced they're entering the server CPU market in March. TSMC will only meet 80% of server CPU wafer demand this year. High-end server CPU prices are already up 50%. When the GPU king and the IP licensor both pivot to CPUs in the same month, the boring chip isn't boring anymore.

Microsoft sold every spare CPU it had to Anthropic and OpenAI. Amazon tripled its CPU buys year over year and still can't keep up. Two of AWS's biggest customers asked Andy Jassy if they could buy the entire 2026 production run of Graviton chips. He said no. The ratio inside an AI datacenter used to be 100 megawatts of GPUs to 1 megawatt of CPUs. CPUs handled storage, checkpointing, pre-processing. Light work. GPUs did the actual training and inference. Then OpenAI shipped o1-preview in September 2024. RL post-training went from "check the model output with a regex" to "run classifiers" to "compile the code and run the unit tests" to "spin up a sandbox, call three databases, run a physics simulation, verify the answer." Every rollout now needs a CPU-backed environment to verify against. Codex 5.4 runs agentically for 6-7 hours at a time. Each database call, each cron job, each scraped URL is CPU work. Coding agent revenue went from a couple billion to north of $10B in six months. That compute is sitting on CPUs. The CPU to GPU ratio is now approaching 1:1. The entire global cloud was built for 1:8. That's why GitHub has been unstable for weeks. Nvidia and Arm both announced they're entering the server CPU market in March. TSMC will only meet 80% of server CPU wafer demand this year. High-end server CPU prices are already up 50%. When the GPU king and the IP licensor both pivot to CPUs in the same month, the boring chip isn't boring anymore.

Aakash Gupta

290,767 views • 3 months ago

For the first time, the latest LLMs run on the Apple Neural Engine — and NexaSDK is the only framework that makes it possible, powered by the NexaML engine. Last year, our two co-founders were invited by Apple DMLI team (Data & Machine Learning Innovation) to share their research about on-device multimodal model for local AI agents. One of the big questions in the room was: “Can the newest LLMs actually run on ANE?” At the time, nobody had a clear path. Today, that path exists. NexaSDK now runs Granite-4.0 (IBM), Qwen3 (Qwen), Gemma3 (Google), and Parakeet-v3 (NVIDIA) fully on Apple’s NPU — unlocking low-power, always-on, fast inference across Mac and iPhone. A new wave of NPU-first local AI apps is coming to Apple devices. Start with one line of code on Mac. iOS SDK coming soon.

For the first time, the latest LLMs run on the Apple Neural Engine — and NexaSDK is the only framework that makes it possible, powered by the NexaML engine. Last year, our two co-founders were invited by Apple DMLI team (Data & Machine Learning Innovation) to share their research about on-device multimodal model for local AI agents. One of the big questions in the room was: “Can the newest LLMs actually run on ANE?” At the time, nobody had a clear path. Today, that path exists. NexaSDK now runs Granite-4.0 (IBM), Qwen3 (Qwen), Gemma3 (Google), and Parakeet-v3 (NVIDIA) fully on Apple’s NPU — unlocking low-power, always-on, fast inference across Mac and iPhone. A new wave of NPU-first local AI apps is coming to Apple devices. Start with one line of code on Mac. iOS SDK coming soon.

NEXA AI

30,213 views • 8 months ago

The era of Stable Intelligence is here 🤖 Tether’s QVAC Fabric just released the world’s first cross-platform 1-bit LLM LoRA fine-tuning framework. QVAC Fabric extends Microsoft's ultra-efficient BitNet architecture, allowing fine-tuning and inference of LLMs directly on your smartphone—no NVIDIA GPU/CUDA lock-in or expensive server required. The Breakthrough: - Total Sovereignty: LoRA fine-tune ultra-efficient models locally on any smartphone, including iPhones, Pixel phones, Samsung Galaxy phones and any desktop/laptop operating systems using Vulkan and Metal backends. - Extreme Efficiency: 1-bit architecture uses up to 90% less memory and runs up to 11x faster than traditional models. - Universal Access: What used to require a data center now runs on the chip in your pocket. Own your intelligence. The era of stable, local AI is here. 📱🧠 Read the full details on Hugging Face and grab the binaries to build on your own hardware.

The era of Stable Intelligence is here 🤖 Tether’s QVAC Fabric just released the world’s first cross-platform 1-bit LLM LoRA fine-tuning framework. QVAC Fabric extends Microsoft's ultra-efficient BitNet architecture, allowing fine-tuning and inference of LLMs directly on your smartphone—no NVIDIA GPU/CUDA lock-in or expensive server required. The Breakthrough: - Total Sovereignty: LoRA fine-tune ultra-efficient models locally on any smartphone, including iPhones, Pixel phones, Samsung Galaxy phones and any desktop/laptop operating systems using Vulkan and Metal backends. - Extreme Efficiency: 1-bit architecture uses up to 90% less memory and runs up to 11x faster than traditional models. - Universal Access: What used to require a data center now runs on the chip in your pocket. Own your intelligence. The era of stable, local AI is here. 📱🧠 Read the full details on Hugging Face and grab the binaries to build on your own hardware.

QVAC

108,569 views • 4 months ago