Artificial Intelligence 11 min read

Master Naive Bayes: From Theory to Python Text Classification

This article introduces the Naive Bayes classifier, explains its underlying probability formulas—including conditional probability, total probability, and the Bayes theorem—covers the feature independence assumption, Laplace smoothing, and demonstrates both manual and scikit‑learn implementations for email and text classification with Python code.

360 Zhihui Cloud Developer

Mar 6, 2018

Master Naive Bayes: From Theory to Python Text Classification

Introduction

Recently the author started learning machine‑learning concepts and chose Naive Bayes as the first topic.

What is Naive Bayes?

Naive Bayes is a classification method based on Bayes’ theorem and the assumption of conditional independence of features. Given a training set it learns the joint probability distribution of inputs and outputs, then for a new input x it selects the class y with the highest posterior probability.

Bayes’ Biography

Thomas Bayes (1701‑1763) was an English mathematician and minister who pioneered probability theory and introduced the concept of inverse probability, now known as Bayes’ theorem.

Algorithm Principles

Conditional probability formula

P(A|B) = P(A∧B) / P(B)
P(B|A) = P(A∧B) / P(A)
P(A|B) = P(B|A) * P(A) / P(B)

Total probability formula P(B) = Σ P(A_i ∧ B) = Σ P(B|A_i) * P(A_i) Bayes formula

P(A_k|B) = [P(A_k) * P(B|A_k)] / Σ P(B|A_i) * P(A_i)

Since the denominator is constant for all classes, it can be omitted, yielding the simplified decision rule: P(A_k|B) ∝ P(A_k) * P(B|A_k) Here P(A_k) is the prior, P(B|A_k) the likelihood, and P(A_k|B) the posterior.

Feature Independence Assumption

For a classification problem with an instance x = (x₁,…,x_n) and classes y = (y₁,…,y_k), the conditional independence assumption leads to:

P(y_k|x) = P(y_k) * ∏ P(x_i|y_k)

Laplace Smoothing

To avoid zero probabilities, Laplace smoothing adds one to each count:

P(y) = (|D_y| + 1) / (|D| + N)
P(x_i|y) = (|D_y, x_i| + 1) / (|D_y| + N_i)

Manual Text Classification Example

1. Tokenize labeled emails and build word‑frequency vectors. 2. Compute P(spam) and P(normal). 3. For a new mail, calculate P(Spam|mail) = P(Spam) * ∏ P(w_i|Spam) and P(Normal|mail) similarly. 4. Choose the class with the larger posterior probability.

Implementation with scikit‑learn

# sklearn implementation of text classification
import os, random
from numpy import *
from numpy.ma import arange
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
import matplotlib.pyplot as plt

def get_dataset():
    data = []
    for root, dirs, files in os.walk(r'./mix20_rand700_tokens_cleaned/tokens/neg'):
        for file in files:
            realpath = os.path.join(root, file)
            with open(realpath, errors='ignore') as f:
                data.append((f.read(), 'bad'))
    for root, dirs, files in os.walk(r'./mix20_rand700_tokens_cleaned/tokens/pos'):
        for file in files:
            realpath = os.path.join(root, file)
            with open(realpath, errors='ignore') as f:
                data.append((f.read(), 'good'))
    random.shuffle(data)
    return data

def train_and_test_data(data_):
    filesize = int(0.7 * len(data_))
    train_data_ = [d[0] for d in data_[:filesize]]
    train_target_ = [d[1] for d in data_[:filesize]]
    test_data_ = [d[0] for d in data_[filesize:]]
    test_target_ = [d[1] for d in data_[filesize:]]
    return train_data_, train_target_, test_data_, test_target_

def mnb(train_da, train_tar, test_da, test_tar):
    nbc = Pipeline([('vect', TfidfVectorizer()),
                     ('clf', MultinomialNB(alpha=1.0))])
    nbc.fit(train_da, train_tar)
    predict = nbc.predict(test_da)
    return sum(p == t for p, t in zip(predict, test_tar)) / len(test_tar)

def bnb(train_da, train_tar, test_da, test_tar):
    nbc_1 = Pipeline([('vect', TfidfVectorizer()),
                      ('clf', BernoulliNB(alpha=1.0))])
    nbc_1.fit(train_da, train_tar)
    predict = nbc_1.predict(test_da)
    return sum(p == t for p, t in zip(predict, test_tar)) / len(test_tar)

x = arange(10)
y1, y2 = [], []
for i in x:
    data = get_dataset()
    train_data, train_target, test_data, test_target = train_and_test_data(data)
    y1.append(mnb(train_data, train_target, test_data, test_target))
    y2.append(bnb(train_data, train_target, test_data, test_target))

plt.plot(x, y1, lw='2', label='MultinomialNB')
plt.plot(x, y2, lw='2', label='BernoulliNB')
plt.legend(loc="upper right")
plt.ylim(0, 1)
plt.grid(True)
plt.show()

Result Comparison

Conclusion

Scikit‑learn (sklearn) is a widely used Python library that unifies many machine‑learning algorithms; mastering one model, such as Naive Bayes, enables rapid application to various classification tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

probability text classification scikit-learn Naive Bayes

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.