Загрузка видео...

Не удалось загрузить видео

На главную

Microsoft researchers release bitnet.cpp, the official inference framework for 1-bit LLMs like BitNet b1.58. It has optimized kernels for fast, lossless inference on CPUs, achieving impressive speedups on ARM and x86 CPUs and significant energy reductions.

75,133 просмотров • 1 год назад •via X (Twitter)

Комментарии: 9

Фото профиля BensenHsu
BensenHsu1 год назад

This study introduces bitnet.cpp, a software stack designed to enable fast and efficient inference of 1-bit large language models (LLMs), such as BitNet b1.58, on CPUs. The researchers aim to unlock the potential of 1-bit LLMs by developing optimized kernels that can achieve significant speedups and reduce energy consumption compared to existing solutions. The results show that bitnet.cpp significantly outperforms the existing llama.cpp framework in terms of both inference speed and energy consumption: • On the Apple M2 Ultra, bitnet.cpp achieves speedups ranging from 1.37x to 5.07x, with larger models experiencing greater performance gains. • On the Intel i7-13700H, bitnet.cpp achieves speedups ranging from 2.37x to 6.17%, with significant improvements for larger models. • bitnet.cpp reduces energy consumption by 55.4% to 70.0% on the Apple M2 Ultra and 71.9% to 82.2% on the Intel i7-13700H, depending on the model size. full paper:

Фото профиля Emily
Emily1 год назад

I hope we see actual world results like LIama 3.2 and other LLMs.

Фото профиля Sandro Hanea
Sandro Hanea1 год назад

Cool work! Played a bit with it and indeed it is degrading the quality a bit, but there are definitely usecases for it. Also, worth mentioning that this builds on top of @ggerganov 's llama.cpp and the most of the inference is still using ggml.

Фото профиля Paul Calcraft
Paul Calcraft1 год назад

Will you release your own 1.58 bitnet models?

Фото профиля Srini Gundelli
Srini Gundelli1 год назад

🫶🏼🤯

Фото профиля ΜΛΛNΙ
ΜΛΛNΙ1 год назад

so fast, but sadly there's still degradation of quality...but it's so much better than a say 4bit model of the same size, which is a huge leap. if there's a way to mediate the degradation, you're golden.

Фото профиля تطوير الالعاب - Ludology
تطوير الالعاب - Ludology1 год назад

Can this ported to snes ? , jk

Фото профиля 🙂🙏 Özv. Dízelné Hadházy Aranka, 1.8T
🙂🙏 Özv. Dízelné Hadházy Aranka, 1.8T1 год назад

Ecosystem services include water , air, soil, energy, and biodiversity. Ecosystem services also include water, air, soil, energy, and biodiversity. Ecosystem services also include water, air, soil, energy, and biodiversity. Excellent essay!

Фото профиля Desmond
Desmond1 год назад

In the implementation of bitnet.cpp’s TL2 Kernel, which compresses every three weights into a 5-bit index with a 1-bit sign, how does the LUT method handle potential collisions or overlapping index values during the computation phase, especially in scenarios involving high-dimensional matrices?

Похожие видео

Dylan Patel says GPUs are no longer the biggest bottleneck. According to Dylan Patel, now CPUs are the constraint. In the early AI era, CPUs were the laggers. You used them for storage, checkpointing, pre-processing, etc. (pretty light workloads) The models weren't agentic and couldn't go step by step. Just string in and string out (simple inference) Then OpenAI launched O1 preview in September '24, and RL training loops have since tightened every month. - initially it was checking model output with regex - then running classifiers - followed by code unit tests + compilation - and finally agentic flows calling databases & scientific simulations The model outputs to an environment, gets verified, and trains on it. Coding agent revenue went from a couple billion to north of $10B in roughly 6 months. Something like Codex 5.4 can work agentically on its own for 6-7 hrs straight - doing all sorts of calls (databases, cron servers, scraping) That requires insane CPU capabilities. And over the last two quarters, the entire cloud market ran out of CPUs. - GitHub has been really unstable lately - Amazon's CPU server installations 3x'd year over year - Microsoft sold all of its spare CPUs to Anthropic & OpenAI Earlier, it was 100 megawatts of GPUs served by 1 megawatt of CPUs. Now that ratio is getting much closer for both RL training and agentic inference. There's simply no capacity anywhere, and it's causing massive instability.

Ivan Burazin

303,023 просмотров • 1 месяц назад