Artificial Intelligence 16 min read

Text Anti‑Spam Detection with TextCNN: From Traditional Methods to Online Deployment

This article introduces the challenges of text‑based spam on the Huajiao platform, reviews traditional rule‑based and machine‑learning classification methods, explains the TextCNN architecture for robust character‑level detection, and details its TensorFlow Serving deployment for real‑time anti‑spam services.

Huajiao Technology
Huajiao Technology
Huajiao Technology
Text Anti‑Spam Detection with TextCNN: From Traditional Methods to Online Deployment

1. Background

As the number of Huajiao users and streamers grows, malicious users exploit the platform's large data flow to spread spam through text, images, audio, and short videos in user profiles and live‑chat, harming user experience and causing churn, while severe illegal content poses operational risks.

2. Problem Analysis

The article focuses on textual spam, categorizing it into advertising, pornographic content, violent/political sensitive words, and competitive or other information. Early low‑volume, simple spam could be handled by manual review, but scaling requires rule‑based strategies and algorithmic models to assist human moderators.

Simple spam can be filtered with keyword rules and regular expressions, yet spammers obfuscate text using pinyin, synonyms, homographs, emojis, or shuffled characters, making rule‑based filtering insufficient; thus precise algorithmic models are needed.

Spam detection is a binary text classification task evaluated by accuracy, precision, recall, and F1‑score.

3. Text Classification Algorithms

3.1 Traditional Methods

Traditional pipelines involve preprocessing (tokenization, stop‑word removal, disambiguation), feature extraction (Bag‑of‑Words, TF‑IDF, Word2Vec), and classifiers such as LR, SVM, MLP, GBDT.

3.2 CNN‑Based Text Classification

Because users employ “fire‑star” characters, emojis, and other non‑standard symbols, tokenization often fails. CNNs can operate on character‑level embeddings without tokenization, capturing local n‑gram features.

Badcase Example

Problems of Traditional Methods

Tokenization fails on unconventional characters.

Even when tokenized, large‑scale corpora for vectorization are scarce.

Rule‑based filters cannot capture fire‑star features.

Therefore a model that treats each character as an atomic unit and learns sequential and semantic information is required.

TextCNN Principle

TextCNN applies convolutional neural networks to NLP, avoiding tokenization by using character‑level embeddings, capturing local order and semantics, employing multiple filter sizes to extract n‑gram features, and achieving fast inference (<50 ms).

Model Structure

TextCNN uses cross‑entropy loss for binary classification, embedding layers, convolution with various filter sizes, max‑pooling, and a softmax output.

Convolution Layer

For a sentence of length n with word‑embedding dimension k , the input forms an n × k matrix. A filter of size h × k slides over the matrix, producing n‑h+1 feature values per filter.

Pooling Layer

Max‑pooling selects the maximum value within each feature map, reducing dimensionality while preserving salient information.

Softmax Output

TextCNN Implementation (TensorFlow)

#coding:utf-8
import tensorflow as tf
import numpy as np

class TextCNN(object):
    def __init__(self, sequence_length, num_classes, vocab_size, embedding_size,
                 filter_sizes, num_filters, l2_reg_lambda=0.0):
        self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
        self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
        self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
        l2_loss = tf.constant(0.0)
        # Embedding
        with tf.device('/cpu:0'), tf.name_scope("embedding"):
            self.W = tf.get_variable('lookup_table',
                     dtype=tf.float32,
                     shape=[vocab_size, embedding_size],
                     initializer=tf.random_uniform_initializer())
            self.W = tf.concat((tf.zeros(shape=[1, embedding_size]), self.W[1:, :]), 0)
            self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
            self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
        # Convolution
        pooled_outputs = []
        for i, filter_size in enumerate(filter_sizes):
            with tf.name_scope("conv-maxpool-%s" % filter_size):
                filter_shape = [filter_size, embedding_size, 1, num_filters]
                W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
                b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
                conv = tf.nn.conv2d(self.embedded_chars_expanded, W, strides=[1,1,1,1],
                                    padding="VALID", name="conv")
                h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
                pooled = tf.nn.max_pool(h, ksize=[1, sequence_length - filter_size + 1, 1, 1],
                                        strides=[1,1,1,1], padding='VALID', name="pool")
                pooled_outputs.append(pooled)
        num_filters_total = num_filters * len(filter_sizes)
        self.h_pool = tf.concat(pooled_outputs, 3)
        self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
        # Dropout
        with tf.name_scope("dropout"):
            self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
        # Output
        with tf.name_scope("output"):
            W = tf.get_variable("W", shape=[num_filters_total, num_classes],
                                 initializer=tf.contrib.layers.xavier_initializer())
            b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
            l2_loss += tf.nn.l2_loss(W)
            l2_loss += tf.nn.l2_loss(b)
            self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
            self.predictions = tf.argmax(self.scores, 1, name="predictions")
        # Loss
        with tf.name_scope("loss"):
            losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
            self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
        # Accuracy
        with tf.name_scope("accuracy"):
            correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
            self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")

Training Results

Conclusion

The section summarizes traditional text classification pipelines, their limitations, and the advantages of deep‑learning approaches, especially TextCNN’s character‑level convolution which is robust to obfuscated “fire‑star” text. CNNs are preferred over word2vec or RNNs for capturing local order, parallel computation, and low latency (<3 ms) suitable for real‑time spam detection.

4. Online Deployment Process

4.1 Service Architecture

The anti‑spam service consists of an online layer for millisecond‑level inference and an offline layer for model retraining and updates.

4.2 TensorFlow Serving Deployment

TensorFlow Serving provides a flexible, high‑performance system for serving trained models in production, supporting hot updates via gRPC.

4.3 Client Invocation

Clients call the service through gRPC, using the automatically generated stubs from the provided protobuf definitions (model.proto, predict.proto, prediction_service.proto).

References

Kim Y. Convolutional neural networks for sentence classification[J]. arXiv preprint arXiv:1408.5882, 2014.

http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf

https://www.cnblogs.com/ljhdo/p/10578047.html

https://tensorflow.google.cn/tfx/serving/architecture

https://baike.baidu.com/item/火星文/608814

CNNdeep learningTensorFlowanti-spamText Classification
Huajiao Technology
Written by

Huajiao Technology

The Huajiao Technology channel shares the latest Huajiao app tech on an irregular basis, offering a learning and exchange platform for tech enthusiasts.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.