Master Naive Bayes: From Theory to Python Text Classification
This article introduces the Naive Bayes classifier, explains its underlying probability formulas—including conditional probability, total probability, and the Bayes theorem—covers the feature independence assumption, Laplace smoothing, and demonstrates both manual and scikit‑learn implementations for email and text classification with Python code.
Introduction
Recently the author started learning machine‑learning concepts and chose Naive Bayes as the first topic.
What is Naive Bayes?
Naive Bayes is a classification method based on Bayes’ theorem and the assumption of conditional independence of features. Given a training set it learns the joint probability distribution of inputs and outputs, then for a new input x it selects the class y with the highest posterior probability.
Bayes’ Biography
Thomas Bayes (1701‑1763) was an English mathematician and minister who pioneered probability theory and introduced the concept of inverse probability, now known as Bayes’ theorem.
Algorithm Principles
Conditional probability formula
<code>P(A|B) = P(A∧B) / P(B)
P(B|A) = P(A∧B) / P(A)
P(A|B) = P(B|A) * P(A) / P(B)</code>Total probability formula
<code>P(B) = Σ P(A_i ∧ B) = Σ P(B|A_i) * P(A_i)</code>Bayes formula
<code>P(A_k|B) = [P(A_k) * P(B|A_k)] / Σ P(B|A_i) * P(A_i)</code>Since the denominator is constant for all classes, it can be omitted, yielding the simplified decision rule:
<code>P(A_k|B) ∝ P(A_k) * P(B|A_k)</code>Here P(A_k) is the prior, P(B|A_k) the likelihood, and P(A_k|B) the posterior.
Feature Independence Assumption
For a classification problem with an instance x = (x₁,…,x_n) and classes y = (y₁,…,y_k), the conditional independence assumption leads to:
<code>P(y_k|x) = P(y_k) * ∏ P(x_i|y_k)</code>Laplace Smoothing
To avoid zero probabilities, Laplace smoothing adds one to each count:
<code>P(y) = (|D_y| + 1) / (|D| + N)
P(x_i|y) = (|D_y, x_i| + 1) / (|D_y| + N_i)</code>Manual Text Classification Example
1. Tokenize labeled emails and build word‑frequency vectors. 2. Compute P(spam) and P(normal). 3. For a new mail, calculate P(Spam|mail) = P(Spam) * ∏ P(w_i|Spam) and P(Normal|mail) similarly. 4. Choose the class with the larger posterior probability.
Implementation with scikit‑learn
<code># sklearn implementation of text classification
import os, random
from numpy import *
from numpy.ma import arange
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
import matplotlib.pyplot as plt
def get_dataset():
data = []
for root, dirs, files in os.walk(r'./mix20_rand700_tokens_cleaned/tokens/neg'):
for file in files:
realpath = os.path.join(root, file)
with open(realpath, errors='ignore') as f:
data.append((f.read(), 'bad'))
for root, dirs, files in os.walk(r'./mix20_rand700_tokens_cleaned/tokens/pos'):
for file in files:
realpath = os.path.join(root, file)
with open(realpath, errors='ignore') as f:
data.append((f.read(), 'good'))
random.shuffle(data)
return data
def train_and_test_data(data_):
filesize = int(0.7 * len(data_))
train_data_ = [d[0] for d in data_[:filesize]]
train_target_ = [d[1] for d in data_[:filesize]]
test_data_ = [d[0] for d in data_[filesize:]]
test_target_ = [d[1] for d in data_[filesize:]]
return train_data_, train_target_, test_data_, test_target_
def mnb(train_da, train_tar, test_da, test_tar):
nbc = Pipeline([('vect', TfidfVectorizer()),
('clf', MultinomialNB(alpha=1.0))])
nbc.fit(train_da, train_tar)
predict = nbc.predict(test_da)
return sum(p == t for p, t in zip(predict, test_tar)) / len(test_tar)
def bnb(train_da, train_tar, test_da, test_tar):
nbc_1 = Pipeline([('vect', TfidfVectorizer()),
('clf', BernoulliNB(alpha=1.0))])
nbc_1.fit(train_da, train_tar)
predict = nbc_1.predict(test_da)
return sum(p == t for p, t in zip(predict, test_tar)) / len(test_tar)
x = arange(10)
y1, y2 = [], []
for i in x:
data = get_dataset()
train_data, train_target, test_data, test_target = train_and_test_data(data)
y1.append(mnb(train_data, train_target, test_data, test_target))
y2.append(bnb(train_data, train_target, test_data, test_target))
plt.plot(x, y1, lw='2', label='MultinomialNB')
plt.plot(x, y2, lw='2', label='BernoulliNB')
plt.legend(loc="upper right")
plt.ylim(0, 1)
plt.grid(True)
plt.show()</code>Result Comparison
Conclusion
Scikit‑learn (sklearn) is a widely used Python library that unifies many machine‑learning algorithms; mastering one model, such as Naive Bayes, enables rapid application to various classification tasks.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.