Naive Bayes Text Classification: Theory, Implementation, and Evaluation
This article explains the principles of Naive Bayes text classification, detailing feature representation, model selection, training and testing procedures, probability calculations, code implementation in Python, and evaluation metrics such as accuracy, precision, recall, PR and ROC curves.
In the field of machine learning, many algorithms exist for tasks such as recommendation, classification, regression, and clustering. Classification aims to assign an object X to a predefined category Y, for example labeling news articles as military, finance, or lifestyle.
The typical workflow for news classification includes: (1) feature representation (e.g., X = {yesterday, investment, market}); (2) feature selection (e.g., X = {domestic, investment, market}); (3) model selection (Naive Bayes classifier); (4) preparing training data; (5) training the model; (6) predicting the class of new instances; and (7) evaluating the results.
Machine‑learning experiments use separate training and test sets to assess a model’s generalization ability. The training set is used to learn the classifier, while the test set provides an unbiased estimate of performance by measuring error on unseen data.
Naive Bayes classification relies on the formula p(y_i|X) = p(X|y_i)·p(y_i) / p(X). Here p(X) is the probability of the input object (often ignored because it is constant), p(y_i) is the prior probability of class y_i, and p(X|y_i) is the likelihood of observing X given class y_i. Under the independence assumption, the likelihood factorises into the product of conditional probabilities for each feature.
Example calculations illustrate how to obtain prior probabilities (e.g., p(y=military)=0.30, p(y=technology)=0.24) and conditional probabilities such as p(google|military)=0.05, p(investment|military)=0.03, etc., based on word frequencies in each class. Two common ways to estimate these conditional probabilities are (a) document‑frequency (number of documents containing the word) and (b) term‑frequency (total occurrences of the word in the class).
Naive Bayes is simple, effective, and yields probabilistic outputs suitable for both binary and multi‑class problems, but its strong independence assumption may be unrealistic for many real‑world datasets.
The following Python script demonstrates data preprocessing, token‑to‑ID conversion, and generation of training and test files. It reads files from a specified directory, assigns numeric tags to three categories (business=1, auto=2, sport=3), builds a vocabulary, and writes token IDs together with the original filename.
import
sys
import
os
import
random
WordList = []
WordIDDic = {}
# training split ratio
TrainingPercent = 0.8
# input folder
inpath = sys.argv[1]
# output file base name
OutFileName = sys.argv[2]
trainOutFile = file(OutFileName+".train", "w")
testOutFile = file(OutFileName+".test", "w")
def ConvertData():
i = 0
tag = 0
for filename in os.listdir(inpath):
# only process three categories
if filename.find("business") != -1:
tag = 1
elif filename.find("auto") != -1:
tag = 2
elif filename.find("sport") != -1:
tag = 3
i += 1
rd = random.random()
outfile = testOutFile
if rd < TrainingPercent:
outfile = trainOutFile
infile = file(inpath+"/"+filename, "r")
outfile.write(str(tag)+" ")
content = infile.read().strip()
content = content.decode("utf-8", 'ignore')
words = content.replace('\n', ' ').split(' ')
for word in words:
if len(word.strip()) < 1:
continue
if word not in WordIDDic:
WordList.append(word)
WordIDDic[word] = len(WordList)
outfile.write(str(WordIDDic[word])+" ")
outfile.write("#"+filename+"\n")
infile.close()
print i, "files loaded!"
print len(WordList), "unique words found!"
ConvertData()
trainOutFile.close()
testOutFile.close()A second script implements the Naive Bayes model: loading data, computing prior and conditional probabilities with Laplace smoothing, saving and loading the model, predicting classes for test instances, and evaluating performance.
# Usage: NB.py 1 TrainingDataFile ModelFile (train)
# Usage: NB.py 0 TestDataFile ModelFile OutFile (test)
import sys, os, math
DefaultFreq = 0.1
... (functions LoadData, ComputeModel, SaveModel, LoadModel, Predict, Evaluate, CalPreRec) ...
if len(sys.argv) < 4:
print "Usage incorrect!"
elif sys.argv[1] == '1':
# training flow
LoadData()
ComputeModel()
SaveModel()
elif sys.argv[1] == '0':
# testing flow
LoadModel()
TList, PList = Predict()
Evaluate(TList, PList)
else:
print "Usage incorrect!"Model performance is assessed with a confusion matrix, from which accuracy, precision, and recall are derived. The article also discusses PR curves (precision‑recall trade‑off) and ROC curves (true‑positive vs. false‑positive rates), explaining how the area under the ROC curve (AUC) quantifies overall discriminative ability. Example calculations illustrate how different thresholds affect these metrics.
Practical DevOps Architecture
Hands‑on DevOps operations using Docker, K8s, Jenkins, and Ansible—empowering ops professionals to grow together through sharing, discussion, knowledge consolidation, and continuous improvement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.