Artificial Intelligence 39 min read

Parameter-Efficient Fine-Tuning (PEFT) Methods for Large Language Models: LoRA, QLoRA, AdaLoRA, SoRA, and Training Acceleration with Unsloth

This article systematically analyzes popular parameter‑efficient fine‑tuning (PEFT) techniques for large language models—including Adapter Tuning, Prefix Tuning, LoRA, QLoRA, AdaLoRA, and SoRA—detailing their principles, implementation code, experimental results on NLU tasks, and practical acceleration using the Unsloth library.

58 Tech

Jun 3, 2024

Parameter-Efficient Fine-Tuning (PEFT) Methods for Large Language Models: LoRA, QLoRA, AdaLoRA, SoRA, and Training Acceleration with Unsloth

Introduction

In 2023, large language models (LLMs) proliferated rapidly. 58.com TEG‑AI Lab built a domain‑specific LLM called ChatLing for real‑estate, recruitment, automotive, and yellow‑page services, achieving better performance than both open‑source and commercial general LLMs.

Parameter‑Efficient Fine‑Tuning (PEFT)

Fine‑tuning a pretrained model is cost‑effective compared to training from scratch, but full‑parameter fine‑tuning of massive LLMs is prohibitively expensive. PEFT methods keep the backbone frozen and only train a small additional module, dramatically reducing memory and compute requirements.

Adapter Tuning

Adapter Tuning inserts a lightweight Adapter module after each Transformer layer. During fine‑tuning only the Adapter parameters are updated while the original weights remain frozen. This reduces trainable parameters but adds extra depth, slightly increasing inference latency.

Prefix Tuning

Prefix Tuning prepends a learnable continuous vector sequence (the prefix ) to the input. The prefix is optimized while the backbone stays frozen, allowing task‑specific adaptation with minimal additional storage. However, it increases the input length and may affect inference speed.

Low‑Rank Adaptation (LoRA)

Principle

LoRA adds two low‑rank matrices A (r×d) and B (d×r) alongside the frozen weight W. The effective weight becomes W + BA, reducing trainable parameters from d² to 2·d·r (r ≪ d). LoRA uses Gaussian initialization for A and zero initialization for B, keeping the original model architecture unchanged during inference.

Key advantages:

Trainable parameters drop from d² to 2·d·r.

No extra computation at inference time.

Multiple LoRA adapters can be stored and loaded independently.

Experiments

Fine‑tuning on two NLU business datasets showed that LoRA (r=2 or r=8) achieved performance comparable to full‑parameter fine‑tuning and significantly better than GPT‑4 zero‑shot.

Metric

Precision

Recall

F1‑Score

Full‑parameter

89.96%

85.53%

87.68%

LoRA (r=2)

89.42%

85.85%

86.23%

LoRA (r=8)

89.54%

86.32%

86.44%

Quantized LoRA (QLoRA)

Principle

QLoRA combines LoRA with 4‑bit quantization (NF4) and double quantization of the quantization constants, reducing model weight size by several times. It also employs block‑wise quantization to mitigate outlier effects.

from scipy.stats import norm
import torch

def create_normal_map(offset=0.9677083, use_extra_value=True):
    if use_extra_value:
        v1 = norm.ppf(torch.linspace(offset, 0.5, 9)[:-1]).tolist()
        v2 = [0] * (16 - 15)
        v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
        v = v1 + v2 + v3
    else:
        v1 = norm.ppf(torch.linspace(offset, 0.5, 8)[:-1]).tolist()
        v2 = [0] * (16 - 14)
        v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
        v = v1 + v2 + v3
    values = torch.Tensor(v).sort().values
    values /= values.max()
    assert values.numel() == 256
    return values

Q = create_normal_map()

The code demonstrates how the quantization map is built; the offset avoids infinite values, and use_extra_value controls symmetric vs. asymmetric quantization.

Experiments

On the same NLU datasets, QLoRA achieved memory usage of 39.1 GB (vs. 55.7 GB for LoRA) with only a slight increase in training time.

Method

Training Memory

Training Time

QLoRA

39.1 GB

8760 s

LoRA

55.7 GB

7192 s

Adaptive Low‑Rank Adapter (AdaLoRA)

Principle

AdaLoRA replaces the fixed rank r of LoRA with per‑module adaptive ranks using singular‑value decomposition (SVD). It introduces matrices P, Λ, and Q (orthogonal) such that ΔW = PΛQ. Orthogonality constraints are added to the loss to keep P and Q unitary.

self.lora_A[adapter_name] = nn.Parameter(torch.randn(r, self.in_features))
self.lora_E[adapter_name] = nn.Parameter(torch.randn(r, 1))
self.lora_B[adapter_name] = nn.Parameter(torch.randn(self.out_features, r))

During training, importance scores (sensitivity × uncertainty) are computed for each triplet, and a budget‑driven pruning schedule masks low‑importance components.

def update_ipt(self, model):
    for n, p in model.named_parameters():
        if "lora_" in n and self.adapter_name in n:
            self.ipt[n] = (p * p.grad).abs().detach()
            self.exp_avg_ipt[n] = self.beta1 * self.exp_avg_ipt[n] + (1 - self.beta1) * self.ipt[n]
            self.exp_avg_unc[n] = self.beta2 * self.exp_avg_unc[n] + (1 - self.beta2) * (self.ipt[n] - self.exp_avg_ipt[n]).abs()

Experiments

AdaLoRA was evaluated on the same NLU datasets. While it reduced parameter count, its performance lagged behind LoRA (e.g., F1‑Score 82.37% vs. 86.44% for LoRA with r=8).

Sparse Low‑Rank Adaptation (SoRA)

Principle

SoRA adds a learnable gate vector g to the LoRA branch, enabling dynamic sparsification of rank dimensions via a proximal‑gradient (soft‑threshold) update:

g_{t+1} = T_{λ}(g_t + 1) where T_{λ} is the soft‑threshold operator.

def __init__(self, ...):
    if r > 0:
        self.lora_A = nn.Parameter(weight.new_zeros((r, in_features)))
        self.lora_B = nn.Parameter(weight.new_zeros((out_features, r)))
        self.gate = nn.Parameter(torch.randn(1, r))
        self.scaling = self.lora_alpha / self.r
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

def forward(self, x):
    return ((self.lora_dropout(x) @ self.lora_A.T).mul(self.gate) @ self.lora_B.T) * self.scaling

The gate is updated outside the standard optimizer step using a sparsity‑inducing rule:

if self.sparse_lambda > 0:
    p.data[p.data > self.sparse_lambda] -= self.sparse_lambda
    p.data[p.data < -self.sparse_lambda] += self.sparse_lambda
    p.data[abs(p.data) < self.sparse_lambda] = 0.0

Experiments

SoRA achieved the best F1‑Score among the three methods on the second NLU business dataset (89.81% precision, 86.08% recall, 86.52% F1), demonstrating the benefit of adaptive sparsity.

Training Acceleration with Unsloth

Unsloth rewrites key transformer kernels (RoPE, MLP, LayerNorm) in Triton to reduce memory and speed up back‑propagation. It works with LoRA/QLoRA and provides zero‑loss accuracy degradation.

from unsloth import FastLanguageModel
if model_args.use_unsloth:
    unsloth_peft_kwargs = {
        "model": model,
        "max_seq_length": model_args.model_max_length,
        "use_gradient_checkpointing": "unsloth",
    }
    model = FastLanguageModel.get_peft_model(**peft_kwargs, **unsloth_peft_kwargs)
else:
    lora_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, use_dora=finetuning_args.use_dora, **peft_kwargs)
    model = get_peft_model(model, lora_config)

Benchmarks on A800 GPUs showed ~30% reduction in training time and >40% memory savings for Qwen‑1.5‑7B and Llama‑3 models, with further gains when combined with flash‑attention.

Model (batchsize)

Flash‑Attn

Unsloth

Speed‑up

qwen‑7b (4)

13.67 h

9.67 h

‑29.3% (1.41×)

qwen‑7b (16)

12 h

7.75 h

‑35.4% (1.55×)

Conclusion

The article presented a comprehensive overview of PEFT methods—Adapter, Prefix, LoRA, QLoRA, AdaLoRA, and SoRA—along with practical acceleration using Unsloth. Experiments demonstrate that combining efficient low‑parameter fine‑tuning with kernel‑level speedups enables high‑quality LLM adaptation with minimal resource consumption.

Author

Liuhui, AI Lab Algorithm Engineer at 58.com, responsible for SFT of the ChatLing LLM.

References

Houlsby et al., "Parameter‑efficient transfer learning for NLP", ICML 2019.

Li & Liang, "Prefix‑tuning: Optimizing continuous prompts for generation", arXiv 2021.

Hu et al., "LoRA: Low‑rank adaptation of large language models", arXiv 2021.

Dettmers et al., "QLoRA: Efficient finetuning of quantized LLMs", NeurIPS 2024.

Zhang et al., "Adaptive budget allocation for parameter‑efficient fine‑tuning", ICLR 2023.

Ding et al., "Sparse low‑rank adaptation of pre‑trained language models", arXiv 2023.

https://github.com/huggingface/peft

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Sora large language models LoRA model fine-tuning QLoRA Training Acceleration PEFT AdaLoRA

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.