Artificial Intelligence 9 min read

Guide to Using SPTM (Simple Pre-trained Model) with qa_match for an AI Competition

This article provides a step‑by‑step tutorial on preparing data, pre‑training the SPTM language model, fine‑tuning a text‑classification model, generating predictions, and creating a submission file for the 58.com AI algorithm competition using the open‑source qa_match toolkit.

58 Tech

Aug 12, 2020

Guide to Using SPTM (Simple Pre-trained Model) with qa_match for an AI Competition

The article introduces intelligent customer service based on AI, emphasizing text matching and classification as core NLP techniques, and presents the open‑source qa_match tool (which supports one‑ and two‑layer knowledge‑base QA) for the 58.com AI algorithm competition.

Background

The first 58.com AI algorithm competition is announced, with 159 teams registered. The guide explains how to use the SPTM (Simple Pre‑trained Model) component of qa_match for the contest.

Model and Data Preparation

1. Download SPTM source code git clone https://github.com/wuba/qa_match.git 2. Enter the qa_match directory cd qa_match 3. Download competition data and unzip it into qa_match/data_demo, creating a data folder.

SPTM Model Pre‑training

Navigate to the sptm folder: cd sptm Create a directory for the pre‑trained model: mkdir -p model/pretrain Run the pre‑training script (TensorFlow 1.8–2.0, Python 3, GPU environment):

nohup python run_pretraining.py \
  --train_file="../data_demo/data/pre_train_data" \
  --vocab_file="../data_demo/data/vocab" \
  --model_save_dir="./model/pretrain" \
  --batch_size=512 \
  --print_step=100 \
  --weight_decay=0 \
  --embedding_dim=1000 \
  --lstm_dim=500 \
  --layer_num=1 \
  --train_step=100000 \
  --warmup_step=10000 \
  --learning_rate=5e-5 \
  --dropout_rate=0.1 \
  --max_predictions_per_seq=10 \
  --clip_norm=1.0 \
  --max_seq_len=100 \
  --use_queue=0 > pretrain.log 2>&1 &

Key parameters are explained in a table (vocab file, train/valid data, lstm dimensions, embedding size, dropout rate, layer number, weight decay, max predictions per sequence, gradient clipping, queue usage, etc.). After successful training, the model checkpoint is saved.

Training the Classification Model

Split the provided train_data into training and validation sets:

shuf ../data_demo/data/train_data | tr -d "\r" > ../data_demo/data/train_data_shuf

head -n1000 ../data_demo/data/train_data_shuf > ../data_demo/data/valid_data_final

tail -n+1001 ../data_demo/data/train_data_shuf > ../data_demo/data/train_data_final

Train the classifier using the pre‑trained checkpoint:

python run_classifier.py \
  --output_id2label_file="model/id2label.has_init" \
  --vocab_file="../data_demo/data/vocab" \
  --train_file="../data_demo/data/train_data_final" \
  --dev_file="../data_demo/data/valid_data_final" \
  --model_save_dir="model/finetune" \
  --lstm_dim=500 \
  --embedding_dim=1000 \
  --opt_type=adam \
  --batch_size=256 \
  --epoch=20 \
  --learning_rate=1e-4 \
  --seed=1 \
  --max_len=100 \
  --print_step=10 \
  --dropout_rate=0.1 \
  --layer_num=1 \
  --init_checkpoint="model/pretrain/lm_pretrain.ckpt-500000"

If you prefer not to use the pre‑trained model, omit the --init_checkpoint argument.

SPTM Model Prediction

Score the competition test set with the fine‑tuned classifier:

python run_prediction.py \
  --input_file="../data_demo/data/test_data" \
  --vocab_file="../data_demo/data/vocab" \
  --id2label_file="model/id2label.has_init" \
  --model_dir="model/finetune" > ../data_demo/data/result_test_raw

The output contains lines like __label__xx, where xx is the predicted standard question ID.

Generating the Competition Submission File

Extract the extended question IDs and predicted standard IDs, then combine them:

awk '{print $2}' test_data > ext_id

awk -F',' '{print $1}' result_test_raw | \
  awk -F'|' '{print $1}' | \
  awk -F'__' '{print $3}' > std_id

echo ext_id,std_id > 58cop.csv

paste -d"," ext_id std_id >> 58cop.csv

Upload the 58cop.csv file; the competition score achieved was 0.6424.

Author

Wang Yong, AI Lab algorithm architect at 58.com, holds a master’s degree from Beijing Institute of Technology, previously worked on video recommendation at Youku, now focuses on NLP algorithm research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI model training NLP competition text classification qa_match SPTM

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.