Prompt Engineering, LLM Supervised Fine‑Tuning, and Mobile Tmall AI Assistant Application
The article explains prompt engineering techniques, supervised fine‑tuning of large language models, and their practical deployment in the Mobile Tmall AI shopping assistant, detailing ChatGPT’s generation steps, Transformer architecture, prompt clarity, delimiters, role‑play, few‑shot and chain‑of‑thought prompting, SFT versus pre‑training, LoRA adapters, data collection, Qwen‑14B training configuration, SDK‑based inference, and comprehensive evaluation.
This article introduces prompt design, large language model (LLM) supervised fine‑tuning (SFT), and their practical deployment in the Mobile Tmall AI shopping assistant project.
ChatGPT basics : The generation process consists of five steps – text preprocessing, token encoding with a multi‑layer Transformer encoder, token‑by‑token prediction using softmax, decoding through a Transformer decoder, and repeating prediction until a stop token or maximum length is reached.
Algorithm core – Transformer : The model is built from an encoder and a decoder, illustrated by the accompanying diagram.
Prompt design covers four essential techniques:
Clarity – use explicit, unambiguous language.
Delimiters – separate instructions and content with symbols such as ### , """ , <> or ''' .
Output format – specify the desired structure (e.g., JSON).
Role‑play – instruct the model to assume a specific persona (e.g., a professional sales assistant).
Examples of good vs. bad prompts are shown in tables, demonstrating how precise wording and formatting improve model responses.
Advanced prompting includes few‑shot learning, chain‑of‑thought (CoT) reasoning, and in‑context learning. Sample one‑shot CoT prompts illustrate how step‑by‑step reasoning yields correct arithmetic answers.
Supervised fine‑tuning (SFT) vs. pre‑training:
Pre‑training learns next‑token prediction from massive unlabeled data, giving the model general language understanding.
Instruction fine‑tuning (SFT) uses labeled instruction‑response pairs to align the model with human intents, especially for domain‑specific tasks.
Related techniques such as P‑tuning, P‑tuning V2, and LoRA are described. LoRA adds low‑rank adapters to keep most model parameters frozen, enabling efficient adaptation with limited resources.
C‑Eval is a comprehensive Chinese benchmark covering humanities, social sciences, and STEM subjects, containing 13,948 questions across 52 disciplines.
Data collection for the AI assistant involves:
Gathering seed e‑commerce queries from conversation logs.
Generalizing questions via prompt‑driven generation.
Human annotation of high‑quality data.
Self‑instruction to expand the dataset using LLMs.
Model training uses the Qwen‑14B base model with the following command‑line configuration:
params="--stage sft \ --model_name_or_path /data/oss_bucket_0/Qwen_14B_Chat_ms_v100/ \ --do_train \ --dataset_dir data \ --dataset xuanji \ --template chatml \ --finetuning_type full \ --output_dir file_path \ --overwrite_cache \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 5 \ --save_strategy epoch \ --save_steps 10000 \ --learning_rate 2e-6 \ --num_train_epochs 3.0 \ --warmup_ratio 0.15 \ --warmup_steps 0 \ --weight_decay 0.1 \ --fp16 ${fp16} \ --bf16 ${bf16} \ --deepspeed ds_config.json \ --max_source_length 4096 \ --max_target_length 4096 \ --use_fast_tokenizer False \ --is_shuffle True \ --val_size 0.0 \"
The training job is submitted on PAI with:
pai -name pytorch112z \ -project algo_platform_dev \ -Dscript='${job_path}' \ -DentryFile='-m torch.distributed.launch --nnodes=${workerCount} --nproc_per_node=${node} ${entry_file}' \ -DuserDefinedParameters="${params}" \ -DworkerCount=${workerCount} \ -Dcluster=${resource_param_config} \ -Dbuckets=${oss_info}${end_point}
Model deployment & inference examples:
DashScope Python SDK:
import dashscope from dashscope import Generation from http import HTTPStatus dashscope.api_key = 'your-dashscope-api-key' response_generator = Generation.call( model='model_name', prompt=build_prompt([ {'role':'system','content':'content_info'}, {'role':'user', 'content':'query'} ]), stream=True, use_raw_prompt=True, seed=random_num ) for resp in response_generator: if resp.status_code == HTTPStatus.OK: print(resp.output) else: print('Failed request_id: %s, status_code: %s, code: %s, message:%s' % (resp.request_id, resp.status_code, resp.code, resp.message))
Whale private‑cloud SDK:
from whale import TextGeneration import json TextGeneration.set_api_key("api_key", base_url="api_url") config = {"pad_token_id": 0, "bos_token_id": 1, "eos_token_id": 2, "user_token_id": 0, "assistant_token_id": 0, "max_new_tokens": 2048, "temperature": 0.95, "top_k": 5, "top_p": 0.7, "repetition_penalty": 1.1, "do_sample": False, "transformers_version": "4.29.2"} prompt = [{"role": "user", "content": "content_info"}] response = TextGeneration.call( model="model_name", prompt=json.dumps(prompt), timeout=120, streaming=True, generate_config=config) for event in response: if event.status_code == 200: if not event.finished: print(event.output['response'], end="") else: print('error_code: [%d], error_message: [%s]' % (event.status_code, event.status_message))
Evaluation combines public benchmarks (knowledge, reasoning, multilingual) and internal business tests (150 questions per task). Model performance is monitored via logging and periodic reviews.
References include seminal works such as "Attention Is All You Need" and recent open‑source LLM resources (Qwen‑14B, ChatGLM‑6B, Baichuan2, Stanford Alpaca, etc.).
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.