Artificial Intelligence 22 min read

Naive Bayes Text Classification: Theory, Implementation, and Evaluation

This article explains the principles of Naive Bayes text classification, detailing feature representation, model selection, training and testing procedures, probability calculations, code implementation in Python, and evaluation metrics such as accuracy, precision, recall, PR and ROC curves.

Practical DevOps Architecture

Aug 2, 2018

Naive Bayes Text Classification: Theory, Implementation, and Evaluation

In the field of machine learning, many algorithms exist for tasks such as recommendation, classification, regression, and clustering. Classification aims to assign an object X to a predefined category Y, for example labeling news articles as military, finance, or lifestyle.

The typical workflow for news classification includes: (1) feature representation (e.g., X = {yesterday, investment, market}); (2) feature selection (e.g., X = {domestic, investment, market}); (3) model selection (Naive Bayes classifier); (4) preparing training data; (5) training the model; (6) predicting the class of new instances; and (7) evaluating the results.

Machine‑learning experiments use separate training and test sets to assess a model’s generalization ability. The training set is used to learn the classifier, while the test set provides an unbiased estimate of performance by measuring error on unseen data.

Naive Bayes classification relies on the formula p(y_i|X) = p(X|y_i)·p(y_i) / p(X). Here p(X) is the probability of the input object (often ignored because it is constant), p(y_i) is the prior probability of class y_i, and p(X|y_i) is the likelihood of observing X given class y_i. Under the independence assumption, the likelihood factorises into the product of conditional probabilities for each feature.

Example calculations illustrate how to obtain prior probabilities (e.g., p(y=military)=0.30, p(y=technology)=0.24) and conditional probabilities such as p(google|military)=0.05, p(investment|military)=0.03, etc., based on word frequencies in each class. Two common ways to estimate these conditional probabilities are (a) document‑frequency (number of documents containing the word) and (b) term‑frequency (total occurrences of the word in the class).

Naive Bayes is simple, effective, and yields probabilistic outputs suitable for both binary and multi‑class problems, but its strong independence assumption may be unrealistic for many real‑world datasets.

The following Python script demonstrates data preprocessing, token‑to‑ID conversion, and generation of training and test files. It reads files from a specified directory, assigns numeric tags to three categories (business=1, auto=2, sport=3), builds a vocabulary, and writes token IDs together with the original filename.

<code style="display: block; padding: 8px; color: rgb(171, 178, 191); line-height: 22px"><ol><li><span style="color: rgb(198, 120, 221)">import</span> sys</li><li><span style="color: rgb(198, 120, 221)">import</span> os</li><li><span style="color: rgb(198, 120, 221)">import</span> random</li><li>WordList = []</li><li>WordIDDic = {}</li><li># training split ratio</li><li>TrainingPercent = 0.8</li><li># input folder</li><li>inpath = sys.argv[1]</li><li># output file base name</li><li>OutFileName = sys.argv[2]</li><li>trainOutFile = file(OutFileName+".train", "w")</li><li>testOutFile = file(OutFileName+".test", "w")</li><li>def ConvertData():</li><li>    i = 0</li><li>    tag = 0</li><li>    for filename in os.listdir(inpath):</li><li>        # only process three categories</li><li>        if filename.find("business") != -1:</li><li>            tag = 1</li><li>        elif filename.find("auto") != -1:</li><li>            tag = 2</li><li>        elif filename.find("sport") != -1:</li><li>            tag = 3</li><li>        i += 1</li><li>        rd = random.random()</li><li>        outfile = testOutFile</li><li>        if rd < TrainingPercent:</li><li>            outfile = trainOutFile</li><li>        infile = file(inpath+"/"+filename, "r")</li><li>        outfile.write(str(tag)+" ")</li><li>        content = infile.read().strip()</li><li>        content = content.decode("utf-8", 'ignore')</li><li>        words = content.replace('
', ' ').split(' ')</li><li>        for word in words:</li><li>            if len(word.strip()) < 1:</li><li>                continue</li><li>            if word not in WordIDDic:</li><li>                WordList.append(word)</li><li>                WordIDDic[word] = len(WordList)</li><li>            outfile.write(str(WordIDDic[word])+" ")</li><li>        outfile.write("#"+filename+"
")</li><li>        infile.close()</li><li>    print i, "files loaded!"</li><li>    print len(WordList), "unique words found!"</li><li>    ConvertData()</li><li>    trainOutFile.close()</li><li>    testOutFile.close()</li></ol></code>

A second script implements the Naive Bayes model: loading data, computing prior and conditional probabilities with Laplace smoothing, saving and loading the model, predicting classes for test instances, and evaluating performance.

<code style="display: block; padding: 8px; color: rgb(171, 178, 191); line-height: 22px"><ol><li># Usage: NB.py 1 TrainingDataFile ModelFile  (train)</li><li># Usage: NB.py 0 TestDataFile ModelFile OutFile  (test)</li><li>import sys, os, math</li><li>DefaultFreq = 0.1</li><li>... (functions LoadData, ComputeModel, SaveModel, LoadModel, Predict, Evaluate, CalPreRec) ...</li><li>if len(sys.argv) < 4:</li><li>    print "Usage incorrect!"</li><li>elif sys.argv[1] == '1':</li><li>    # training flow</li><li>    LoadData()</li><li>    ComputeModel()</li><li>    SaveModel()</li><li>elif sys.argv[1] == '0':</li><li>    # testing flow</li><li>    LoadModel()</li><li>    TList, PList = Predict()</li><li>    Evaluate(TList, PList)</li><li>else:</li><li>    print "Usage incorrect!"</li></ol></code>

Model performance is assessed with a confusion matrix, from which accuracy, precision, and recall are derived. The article also discusses PR curves (precision‑recall trade‑off) and ROC curves (true‑positive vs. false‑positive rates), explaining how the area under the ROC curve (AUC) quantifies overall discriminative ability. Example calculations illustrate how different thresholds affect these metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Python recall evaluation metrics precision text classification Naive Bayes

Written by

Practical DevOps Architecture

Hands‑on DevOps operations using Docker, K8s, Jenkins, and Ansible—empowering ops professionals to grow together through sharing, discussion, knowledge consolidation, and continuous improvement.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.