Artificial Intelligence 9 min read

Guide to Using SPTM (Simple Pre-trained Model) with qa_match for an AI Competition

This article provides a step‑by‑step tutorial on preparing data, pre‑training the SPTM language model, fine‑tuning a text‑classification model, generating predictions, and creating a submission file for the 58.com AI algorithm competition using the open‑source qa_match toolkit.

58 Tech
58 Tech
58 Tech
Guide to Using SPTM (Simple Pre-trained Model) with qa_match for an AI Competition

The article introduces intelligent customer service based on AI, emphasizing text matching and classification as core NLP techniques, and presents the open‑source qa_match tool (which supports one‑ and two‑layer knowledge‑base QA) for the 58.com AI algorithm competition.

Background

The first 58.com AI algorithm competition is announced, with 159 teams registered. The guide explains how to use the SPTM (Simple Pre‑trained Model) component of qa_match for the contest.

Model and Data Preparation

1. Download SPTM source code

git clone https://github.com/wuba/qa_match.git

2. Enter the qa_match directory

cd qa_match

3. Download competition data and unzip it into qa_match/data_demo , creating a data folder.

SPTM Model Pre‑training

Navigate to the sptm folder:

cd sptm

Create a directory for the pre‑trained model:

mkdir -p model/pretrain

Run the pre‑training script (TensorFlow 1.8–2.0, Python 3, GPU environment):

nohup python run_pretraining.py \
  --train_file="../data_demo/data/pre_train_data" \
  --vocab_file="../data_demo/data/vocab" \
  --model_save_dir="./model/pretrain" \
  --batch_size=512 \
  --print_step=100 \
  --weight_decay=0 \
  --embedding_dim=1000 \
  --lstm_dim=500 \
  --layer_num=1 \
  --train_step=100000 \
  --warmup_step=10000 \
  --learning_rate=5e-5 \
  --dropout_rate=0.1 \
  --max_predictions_per_seq=10 \
  --clip_norm=1.0 \
  --max_seq_len=100 \
  --use_queue=0 > pretrain.log 2>&1 &

Key parameters are explained in a table (vocab file, train/valid data, lstm dimensions, embedding size, dropout rate, layer number, weight decay, max predictions per sequence, gradient clipping, queue usage, etc.). After successful training, the model checkpoint is saved.

Training the Classification Model

Split the provided train_data into training and validation sets:

shuf ../data_demo/data/train_data | tr -d "\r" > ../data_demo/data/train_data_shuf
head -n1000 ../data_demo/data/train_data_shuf > ../data_demo/data/valid_data_final
tail -n+1001 ../data_demo/data/train_data_shuf > ../data_demo/data/train_data_final

Train the classifier using the pre‑trained checkpoint:

python run_classifier.py \
  --output_id2label_file="model/id2label.has_init" \
  --vocab_file="../data_demo/data/vocab" \
  --train_file="../data_demo/data/train_data_final" \
  --dev_file="../data_demo/data/valid_data_final" \
  --model_save_dir="model/finetune" \
  --lstm_dim=500 \
  --embedding_dim=1000 \
  --opt_type=adam \
  --batch_size=256 \
  --epoch=20 \
  --learning_rate=1e-4 \
  --seed=1 \
  --max_len=100 \
  --print_step=10 \
  --dropout_rate=0.1 \
  --layer_num=1 \
  --init_checkpoint="model/pretrain/lm_pretrain.ckpt-500000"

If you prefer not to use the pre‑trained model, omit the --init_checkpoint argument.

SPTM Model Prediction

Score the competition test set with the fine‑tuned classifier:

python run_prediction.py \
  --input_file="../data_demo/data/test_data" \
  --vocab_file="../data_demo/data/vocab" \
  --id2label_file="model/id2label.has_init" \
  --model_dir="model/finetune" > ../data_demo/data/result_test_raw

The output contains lines like __label__xx , where xx is the predicted standard question ID.

Generating the Competition Submission File

Extract the extended question IDs and predicted standard IDs, then combine them:

awk '{print $2}' test_data > ext_id
awk -F',' '{print $1}' result_test_raw | \
  awk -F'|' '{print $1}' | \
  awk -F'__' '{print $3}' > std_id
echo ext_id,std_id > 58cop.csv
paste -d"," ext_id std_id >> 58cop.csv

Upload the 58cop.csv file; the competition score achieved was 0.6424.

Author

Wang Yong, AI Lab algorithm architect at 58.com, holds a master’s degree from Beijing Institute of Technology, previously worked on video recommendation at Youku, now focuses on NLP algorithm research.

AImodel trainingNLPcompetitionText Classificationqa_matchSPTM
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.