Artificial Intelligence 8 min read

How to Use spaCy to Auto‑Split and Classify Multi‑Platform Error Logs

This article explains how to preprocess diverse error messages from apps, mini‑programs, and browsers, then fine‑tune a small spaCy NLP model with data generated via ChatGPT to automatically split and categorize errors for alerting and workflow handling.

Goodme Frontend Team

Nov 25, 2024

How to Use spaCy to Auto‑Split and Classify Multi‑Platform Error Logs

Background

Previously I shared a post about how GuMing built a frontend data center, which included an error monitoring module.

In that module we collect errors from APPs, mini‑programs (WeChat, Alipay, DingTalk, Douyin, etc.), browsers, webviews, and other platforms.

Mini‑programs are written with Taro, but Taro’s error is a single string that merges the message and stack trace.

Consequently we receive many different error formats, for example:

MiniProgramError
Module build failed (from xxx.js):
SyntaxError: xxx.tsx: Unexpected token (180:12)

[0m [90m 178 |               { ...
 179 |                 isReady && indexFallback?.enableActivity &&
> 180 |               }
   |               ^
 181 |
 182 |                 <TaskSection position="index"/>
 183 |
    at toParseError (xxx.ts:74:19)
    at TypeScriptParserMixin.raise (xxx.ts:1490:19)

USER_PAGE_NOT_FOUND_ERROR: Page[xxx] not found. May be caused by: 1. Forgot to add page route in app.json. 2. Invoking Page() in async task. construct@[native code]
 t@file:xxx.js:2:43094
 t@file:xxx.js:2:55180
 t@file:xxx.js:2:55373
 @file:xxx.js:2:524958
 OC@file:xxx.js:2:525081

Trying to match these with regular expressions is impractical because the formats are too varied.

How to Split

For alerting, ignoring, and workflow processing we need to consolidate errors of the same type instead of letting the message field scatter them.

For common errors such as TypeError or generic Error , we split them using regular expressions directly in the SDK.

For custom mini‑program errors and other platform‑specific messages we employ an AI‑based approach to recognize and categorize them.

Technical Solution

Many open‑source models can be fine‑tuned for this task, e.g., Hugging Face models, spaCy, etc.

In this case we use spaCy for model fine‑tuning.

spaCy provides a small pre‑trained English model en_core_web_sm, which is lightweight, fast to load, and low on resource consumption.

Fine‑Tuning the Model

Preparing the Training Dataset

The most important step is preparing the training data.

Because I was a bit lazy, I let ChatGPT help generate the dataset and performed manual verification.

First, I exported dozens of thousands of previously reported error logs from the database as a CSV file.

Then I fed the file to ChatGPT, instructing it how to transform the data.

After reviewing the data, I told ChatGPT how to classify any entries that could not be correctly categorized.

Finally I asked ChatGPT to export the prepared training dataset.

Training the Model

Run the training code (simplified):

import spacy
import random
from spacy.training.example import Example
import json

# Load spaCy‑formatted training data
with open("xxx.json", "r", encoding="utf-8") as f:
    train_data = json.load(f)

# Load or create a blank spaCy model
nlp = spacy.blank("en")
if "ner" not in nlp.pipe_names:
    ner = nlp.add_pipe("ner", last=True)
else:
    ner = nlp.get_pipe("ner")

# Add entity labels (e.g., "MESSAGE" or "STACK")
for _, annotations in train_data:
    for ent in annotations["entities"]:
        ner.add_label(ent[2])

# Prepare training examples
train_examples = []
for text, annotations in train_data:
    example = Example.from_dict(nlp.make_doc(text), annotations)
    train_examples.append(example)

optimizer = nlp.initialize()

n_iter = 10  # number of training epochs
for epoch in range(n_iter):
    random.shuffle(train_examples)
    losses = {}
    for batch in spacy.util.minibatch(train_examples, size=8):
        nlp.update(batch, sgd=optimizer, drop=0.35, losses=losses)
    print(f"Epoch {epoch+1}/{n_iter} - Losses: {losses}")

# Save the model
nlp.to_disk("/mnt/data/error_log_model")
print("Model saved to '/mnt/data/error_log_model'")

When the training log shows the loss decreasing steadily, the training is proceeding correctly. Continue training until the loss plateaus or reaches an acceptable range; a low loss indicates the model has learned sufficient features.

Testing the Model

import spacy

# Load the trained model
MODEL_PATH = "error_log_model"  # replace with actual path
nlp = spacy.load(MODEL_PATH)

# Use test data to evaluate the model before deployment

Future Work

After deployment, we can improve concurrency performance by using feature caching and other optimizations.

When new errors are identified, we can manually label them through the platform, feed the labeled data back into the training set, and continuously fine‑tune the model.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python AI model fine-tuning error-monitoring log classification spaCy

Written by

Goodme Frontend Team

Regularly sharing the team's insights and expertise in the frontend field

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.