Text Anti‑Spam Detection with TextCNN: From Traditional Methods to Online Deployment
This article introduces the challenges of text‑based spam on the Huajiao platform, reviews traditional rule‑based and machine‑learning classification methods, explains the TextCNN architecture for robust character‑level detection, and details its TensorFlow Serving deployment for real‑time anti‑spam services.
1. Background
As the number of Huajiao users and streamers grows, malicious users exploit the platform's large data flow to spread spam through text, images, audio, and short videos in user profiles and live‑chat, harming user experience and causing churn, while severe illegal content poses operational risks.
2. Problem Analysis
The article focuses on textual spam, categorizing it into advertising, pornographic content, violent/political sensitive words, and competitive or other information. Early low‑volume, simple spam could be handled by manual review, but scaling requires rule‑based strategies and algorithmic models to assist human moderators.
Simple spam can be filtered with keyword rules and regular expressions, yet spammers obfuscate text using pinyin, synonyms, homographs, emojis, or shuffled characters, making rule‑based filtering insufficient; thus precise algorithmic models are needed.
Spam detection is a binary text classification task evaluated by accuracy, precision, recall, and F1‑score.
3. Text Classification Algorithms
3.1 Traditional Methods
Traditional pipelines involve preprocessing (tokenization, stop‑word removal, disambiguation), feature extraction (Bag‑of‑Words, TF‑IDF, Word2Vec), and classifiers such as LR, SVM, MLP, GBDT.
3.2 CNN‑Based Text Classification
Because users employ “fire‑star” characters, emojis, and other non‑standard symbols, tokenization often fails. CNNs can operate on character‑level embeddings without tokenization, capturing local n‑gram features.
Badcase Example
Problems of Traditional Methods
Tokenization fails on unconventional characters.
Even when tokenized, large‑scale corpora for vectorization are scarce.
Rule‑based filters cannot capture fire‑star features.
Therefore a model that treats each character as an atomic unit and learns sequential and semantic information is required.
TextCNN Principle
TextCNN applies convolutional neural networks to NLP, avoiding tokenization by using character‑level embeddings, capturing local order and semantics, employing multiple filter sizes to extract n‑gram features, and achieving fast inference (<50 ms).
Model Structure
TextCNN uses cross‑entropy loss for binary classification, embedding layers, convolution with various filter sizes, max‑pooling, and a softmax output.
Convolution Layer
For a sentence of length n with word‑embedding dimension k , the input forms an n × k matrix. A filter of size h × k slides over the matrix, producing n‑h+1 feature values per filter.
Pooling Layer
Max‑pooling selects the maximum value within each feature map, reducing dimensionality while preserving salient information.
Softmax Output
TextCNN Implementation (TensorFlow)
#coding:utf-8
import tensorflow as tf
import numpy as np
class TextCNN(object):
def __init__(self, sequence_length, num_classes, vocab_size, embedding_size,
filter_sizes, num_filters, l2_reg_lambda=0.0):
self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
l2_loss = tf.constant(0.0)
# Embedding
with tf.device('/cpu:0'), tf.name_scope("embedding"):
self.W = tf.get_variable('lookup_table',
dtype=tf.float32,
shape=[vocab_size, embedding_size],
initializer=tf.random_uniform_initializer())
self.W = tf.concat((tf.zeros(shape=[1, embedding_size]), self.W[1:, :]), 0)
self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
# Convolution
pooled_outputs = []
for i, filter_size in enumerate(filter_sizes):
with tf.name_scope("conv-maxpool-%s" % filter_size):
filter_shape = [filter_size, embedding_size, 1, num_filters]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
conv = tf.nn.conv2d(self.embedded_chars_expanded, W, strides=[1,1,1,1],
padding="VALID", name="conv")
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
pooled = tf.nn.max_pool(h, ksize=[1, sequence_length - filter_size + 1, 1, 1],
strides=[1,1,1,1], padding='VALID', name="pool")
pooled_outputs.append(pooled)
num_filters_total = num_filters * len(filter_sizes)
self.h_pool = tf.concat(pooled_outputs, 3)
self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
# Dropout
with tf.name_scope("dropout"):
self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
# Output
with tf.name_scope("output"):
W = tf.get_variable("W", shape=[num_filters_total, num_classes],
initializer=tf.contrib.layers.xavier_initializer())
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
l2_loss += tf.nn.l2_loss(W)
l2_loss += tf.nn.l2_loss(b)
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
self.predictions = tf.argmax(self.scores, 1, name="predictions")
# Loss
with tf.name_scope("loss"):
losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
# Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")Training Results
Conclusion
The section summarizes traditional text classification pipelines, their limitations, and the advantages of deep‑learning approaches, especially TextCNN’s character‑level convolution which is robust to obfuscated “fire‑star” text. CNNs are preferred over word2vec or RNNs for capturing local order, parallel computation, and low latency (<3 ms) suitable for real‑time spam detection.
4. Online Deployment Process
4.1 Service Architecture
The anti‑spam service consists of an online layer for millisecond‑level inference and an offline layer for model retraining and updates.
4.2 TensorFlow Serving Deployment
TensorFlow Serving provides a flexible, high‑performance system for serving trained models in production, supporting hot updates via gRPC.
4.3 Client Invocation
Clients call the service through gRPC, using the automatically generated stubs from the provided protobuf definitions (model.proto, predict.proto, prediction_service.proto).
References
Kim Y. Convolutional neural networks for sentence classification[J]. arXiv preprint arXiv:1408.5882, 2014.
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf
https://www.cnblogs.com/ljhdo/p/10578047.html
https://tensorflow.google.cn/tfx/serving/architecture
https://baike.baidu.com/item/火星文/608814
Huajiao Technology
The Huajiao Technology channel shares the latest Huajiao app tech on an irregular basis, offering a learning and exchange platform for tech enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.