Parameter-Efficient Fine-Tuning (PEFT) Methods for Large Language Models: LoRA, QLoRA, AdaLoRA, SoRA, and Training Acceleration with Unsloth
This article systematically analyzes popular parameter‑efficient fine‑tuning (PEFT) techniques for large language models—including Adapter Tuning, Prefix Tuning, LoRA, QLoRA, AdaLoRA, and SoRA—detailing their principles, implementation code, experimental results on NLU tasks, and practical acceleration using the Unsloth library.
Introduction
In 2023, large language models (LLMs) proliferated rapidly. 58.com TEG‑AI Lab built a domain‑specific LLM called ChatLing for real‑estate, recruitment, automotive, and yellow‑page services, achieving better performance than both open‑source and commercial general LLMs.
Parameter‑Efficient Fine‑Tuning (PEFT)
Fine‑tuning a pretrained model is cost‑effective compared to training from scratch, but full‑parameter fine‑tuning of massive LLMs is prohibitively expensive. PEFT methods keep the backbone frozen and only train a small additional module, dramatically reducing memory and compute requirements.
Adapter Tuning
Adapter Tuning inserts a lightweight Adapter module after each Transformer layer. During fine‑tuning only the Adapter parameters are updated while the original weights remain frozen. This reduces trainable parameters but adds extra depth, slightly increasing inference latency.
Prefix Tuning
Prefix Tuning prepends a learnable continuous vector sequence (the prefix ) to the input. The prefix is optimized while the backbone stays frozen, allowing task‑specific adaptation with minimal additional storage. However, it increases the input length and may affect inference speed.
Low‑Rank Adaptation (LoRA)
Principle
LoRA adds two low‑rank matrices A (r×d) and B (d×r) alongside the frozen weight W . The effective weight becomes W + BA , reducing trainable parameters from d² to 2·d·r (r ≪ d). LoRA uses Gaussian initialization for A and zero initialization for B , keeping the original model architecture unchanged during inference.
Key advantages:
Trainable parameters drop from d² to 2·d·r.
No extra computation at inference time.
Multiple LoRA adapters can be stored and loaded independently.
Experiments
Fine‑tuning on two NLU business datasets showed that LoRA (r=2 or r=8) achieved performance comparable to full‑parameter fine‑tuning and significantly better than GPT‑4 zero‑shot.
Metric
Precision
Recall
F1‑Score
Full‑parameter
89.96%
85.53%
87.68%
LoRA (r=2)
89.42%
85.85%
86.23%
LoRA (r=8)
89.54%
86.32%
86.44%
Quantized LoRA (QLoRA)
Principle
QLoRA combines LoRA with 4‑bit quantization (NF4) and double quantization of the quantization constants, reducing model weight size by several times. It also employs block‑wise quantization to mitigate outlier effects.
from scipy.stats import norm
import torch
def create_normal_map(offset=0.9677083, use_extra_value=True):
if use_extra_value:
v1 = norm.ppf(torch.linspace(offset, 0.5, 9)[:-1]).tolist()
v2 = [0] * (16 - 15)
v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
v = v1 + v2 + v3
else:
v1 = norm.ppf(torch.linspace(offset, 0.5, 8)[:-1]).tolist()
v2 = [0] * (16 - 14)
v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
v = v1 + v2 + v3
values = torch.Tensor(v).sort().values
values /= values.max()
assert values.numel() == 256
return values
Q = create_normal_map()The code demonstrates how the quantization map is built; the offset avoids infinite values, and use_extra_value controls symmetric vs. asymmetric quantization.
Experiments
On the same NLU datasets, QLoRA achieved memory usage of 39.1 GB (vs. 55.7 GB for LoRA) with only a slight increase in training time.
Method
Training Memory
Training Time
QLoRA
39.1 GB
8760 s
LoRA
55.7 GB
7192 s
Adaptive Low‑Rank Adapter (AdaLoRA)
Principle
AdaLoRA replaces the fixed rank r of LoRA with per‑module adaptive ranks using singular‑value decomposition (SVD). It introduces matrices P , Λ , and Q (orthogonal) such that ΔW = PΛQ . Orthogonality constraints are added to the loss to keep P and Q unitary.
self.lora_A[adapter_name] = nn.Parameter(torch.randn(r, self.in_features))
self.lora_E[adapter_name] = nn.Parameter(torch.randn(r, 1))
self.lora_B[adapter_name] = nn.Parameter(torch.randn(self.out_features, r))During training, importance scores (sensitivity × uncertainty) are computed for each triplet, and a budget‑driven pruning schedule masks low‑importance components.
def update_ipt(self, model):
for n, p in model.named_parameters():
if "lora_" in n and self.adapter_name in n:
self.ipt[n] = (p * p.grad).abs().detach()
self.exp_avg_ipt[n] = self.beta1 * self.exp_avg_ipt[n] + (1 - self.beta1) * self.ipt[n]
self.exp_avg_unc[n] = self.beta2 * self.exp_avg_unc[n] + (1 - self.beta2) * (self.ipt[n] - self.exp_avg_ipt[n]).abs()Experiments
AdaLoRA was evaluated on the same NLU datasets. While it reduced parameter count, its performance lagged behind LoRA (e.g., F1‑Score 82.37% vs. 86.44% for LoRA with r=8).
Sparse Low‑Rank Adaptation (SoRA)
Principle
SoRA adds a learnable gate vector g to the LoRA branch, enabling dynamic sparsification of rank dimensions via a proximal‑gradient (soft‑threshold) update:
g_{t+1} = T_{λ}(g_t + 1) where T_{λ} is the soft‑threshold operator.
def __init__(self, ...):
if r > 0:
self.lora_A = nn.Parameter(weight.new_zeros((r, in_features)))
self.lora_B = nn.Parameter(weight.new_zeros((out_features, r)))
self.gate = nn.Parameter(torch.randn(1, r))
self.scaling = self.lora_alpha / self.r
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x):
return ((self.lora_dropout(x) @ self.lora_A.T).mul(self.gate) @ self.lora_B.T) * self.scalingThe gate is updated outside the standard optimizer step using a sparsity‑inducing rule:
if self.sparse_lambda > 0:
p.data[p.data > self.sparse_lambda] -= self.sparse_lambda
p.data[p.data < -self.sparse_lambda] += self.sparse_lambda
p.data[abs(p.data) < self.sparse_lambda] = 0.0Experiments
SoRA achieved the best F1‑Score among the three methods on the second NLU business dataset (89.81% precision, 86.08% recall, 86.52% F1), demonstrating the benefit of adaptive sparsity.
Training Acceleration with Unsloth
Unsloth rewrites key transformer kernels (RoPE, MLP, LayerNorm) in Triton to reduce memory and speed up back‑propagation. It works with LoRA/QLoRA and provides zero‑loss accuracy degradation.
from unsloth import FastLanguageModel
if model_args.use_unsloth:
unsloth_peft_kwargs = {
"model": model,
"max_seq_length": model_args.model_max_length,
"use_gradient_checkpointing": "unsloth",
}
model = FastLanguageModel.get_peft_model(**peft_kwargs, **unsloth_peft_kwargs)
else:
lora_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, use_dora=finetuning_args.use_dora, **peft_kwargs)
model = get_peft_model(model, lora_config)Benchmarks on A800 GPUs showed ~30% reduction in training time and >40% memory savings for Qwen‑1.5‑7B and Llama‑3 models, with further gains when combined with flash‑attention.
Model (batchsize)
Flash‑Attn
Unsloth
Speed‑up
qwen‑7b (4)
13.67 h
9.67 h
‑29.3% (1.41×)
qwen‑7b (16)
12 h
7.75 h
‑35.4% (1.55×)
Conclusion
The article presented a comprehensive overview of PEFT methods—Adapter, Prefix, LoRA, QLoRA, AdaLoRA, and SoRA—along with practical acceleration using Unsloth. Experiments demonstrate that combining efficient low‑parameter fine‑tuning with kernel‑level speedups enables high‑quality LLM adaptation with minimal resource consumption.
Author
Liuhui, AI Lab Algorithm Engineer at 58.com, responsible for SFT of the ChatLing LLM.
References
Houlsby et al., "Parameter‑efficient transfer learning for NLP", ICML 2019.
Li & Liang, "Prefix‑tuning: Optimizing continuous prompts for generation", arXiv 2021.
Hu et al., "LoRA: Low‑rank adaptation of large language models", arXiv 2021.
Dettmers et al., "QLoRA: Efficient finetuning of quantized LLMs", NeurIPS 2024.
Zhang et al., "Adaptive budget allocation for parameter‑efficient fine‑tuning", ICLR 2023.
Ding et al., "Sparse low‑rank adaptation of pre‑trained language models", arXiv 2023.
https://github.com/huggingface/peft
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.