Microsoft Research Releases BitNet b1.58 2B4T: A 1‑Bit Native Large Language Model with Ultra‑Low Memory and Energy Consumption
Microsoft Research introduced BitNet b1.58 2B4T, a native 1‑bit large language model with 2 billion parameters trained on 4 trillion tokens, achieving only 0.4 GB non‑embedding memory, 0.028 J decoding energy, and 29 ms CPU latency while matching full‑precision performance.
Microsoft Research announced an open‑source, native 1‑bit large language model (LLM) called BitNet b1.58 2B4T . The model has 2 billion parameters, was trained on a 4 trillion‑token corpus, and is released with open‑source inference implementations for both GPU and CPU.
Memory usage: The non‑embedding layers occupy only 0.4 GB , far less than full‑precision models such as Qwen2.5 1.5B (2.6 GB) or even its INT4‑quantized version (0.7 GB).
Energy consumption: Estimated decoding energy is 0.028 J , significantly lower than comparable models.
Decoding latency: On a CPU the average latency is 29 ms , again well below other models.
The model’s parameter count reaches 20 billion and it was trained on a dataset containing 4 trillion tokens . In benchmark tests it delivers performance comparable to leading full‑precision models of similar scale, such as LLaMA 3.2 1B, Qwen2.5 1.5B, and Gemma‑3 1B.
Unlike existing 1‑bit models that are either post‑training quantized from full‑precision models (with large performance loss) or small native 1‑bit models, BitNet b1.58 2B4T is trained from scratch. Its core innovation is the replacement of standard full‑precision linear layers with custom BitLinear layers.
The BitLinear layers include:
Weight quantization: Weights are quantized to 1.58 bits using an absolute‑mean (absmean) scheme, mapping them to three values {‑1, 0, +1}.
Activation quantization: Activations in linear projections are quantized to 8‑bit integers using an absolute‑max (absmax) strategy applied per token.
Normalization: Sub‑layer normalization (subln) is introduced to improve training stability.
Additional techniques integrated into the model include ReLU2 activation in the feed‑forward network, RoPE positional encoding, and the removal of bias terms from all linear and normalization layers.
The training process consists of three stages:
Pre‑training: A two‑stage learning‑rate schedule with weight decay on a mixed corpus of public text and code, providing broad world knowledge and basic language abilities.
Supervised fine‑tuning (SFT): Instruction‑following and dialogue datasets are used to enhance the model’s ability to follow commands and engage in conversational formats.
Direct Preference Optimization (DPO): Aligns the model with human preferences for usefulness and safety by directly optimizing on preference data, avoiding a separate reward model.
Model weights have been publicly released on Hugging Face, quickly reaching the top spot on the platform’s trending list. The following links provide access to the technical report and model repository:
https://arxiv.org/pdf/2504.12285
https://hf-mirror.com/microsoft/bitnet-b1.58-2B-4T
BitNet b1.58 2B4T Technical ReportFor further reading, the article includes references to related AI research and community discussions.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.