Artificial Intelligence 17 min read

Understanding Large Language Model Architecture, Parameters, Memory, Storage, and Fine‑Tuning Techniques

This article provides a comprehensive overview of large language models (LLMs), covering their transformer architecture, parameter counts, GPU memory and storage requirements, and detailed fine‑tuning methods such as prompt engineering, data construction, LoRA, PEFT, RLHF, and DPO, along with practical deployment and inference acceleration strategies.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Understanding Large Language Model Architecture, Parameters, Memory, Storage, and Fine‑Tuning Techniques

The article introduces the structure of large language models (LLMs), explaining that modern LLMs are built on the Transformer decoder stack, where each layer consists of multi‑head self‑attention and an MLP block. It derives the parameter count for a transformer layer as 12h²+13h and shows the total model size formula L(12h²+13h)+Vh , where L is the number of layers, h the hidden dimension, and V the vocabulary size.

Memory consumption is estimated by multiplying the number of parameters by the byte size of the chosen precision (e.g., 4 bytes for fp32, 2 bytes for fp16). For a 70 B‑parameter model like LLaMA‑70B, the required GPU memory in 16‑bit precision is roughly 168 GB.

Storage requirements are discussed, noting that fp16 weights halve the size compared to fp32, with typical LLaMA‑7B storage around 13.5 GB after accounting for layer‑norm parameters.

Fine‑tuning is divided into three major components:

Prompt engineering : crafting structured, specific prompts; best practices are linked to OpenAI’s guide.

Data construction : generating high‑quality instruction data (e.g., via self‑instruct) and ensuring diverse, balanced datasets.

Parameter‑efficient fine‑tuning (PEFT) : techniques like LoRA that add low‑rank adapters to attention matrices (weights Wq, Wk, Wv, Wo ) while keeping the base model frozen, drastically reducing trainable parameters.

The article also describes reinforcement learning from human feedback (RLHF) and the newer DPO (Direct Preference Optimization) method, which requires only a reference and a policy model, simplifying training compared to PPO.

Practical training pipelines are presented using internal platforms (Nebula, TuningFactory, Whale) and external tools such as LLaMA‑Factory. Example command‑line snippets illustrate how to launch LoRA or DPO training with DeepSpeed configuration:

<code>WORLD_SIZE=8
LR=1e-5
LORA_CKPT="digital_live_chat.sft_model_whale/version=v20.26/ckpt_id=checkpoint-210"
args="--stage dpo \
    --model_name_or_path=$MODEL_NAME \
    --do_train \
    --do_eval \
    ..."
</code>

Deployment steps cover model registration on the Whale platform, inference acceleration options (fp16/int8 precision, system‑prompt caching, speculative decoding), and SDK usage:

<code>from whale import TextGeneration, VipServerLocator
response = TextGeneration.chat(model="Qwen-72B-Chat-Pro", messages=msgs, stream=True, temperature=1.0, max_tokens=2000, timeout=Timeout(60,20), top_p=0.8, extend_fields=extend_fields)
</code>

Evaluation strategies emphasize building a representative test set, defining clear metrics (human or model‑based), and iterating on data quality, model size, and training hyper‑parameters to achieve the desired performance before production rollout.

LLMprompt engineeringModel DeploymentFine-tuningLoRARLHFDPO
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.