Introduction to Naive Bayes Classifier with scikit-learn
This article introduces the Naive Bayes classification algorithm, explains its theoretical basis, demonstrates how to use scikit-learn's GaussianNB class with Python code, evaluates model performance, and discusses advantages, limitations, and practical examples of the method.
1. Overview
This article explains the Naive Bayes classification algorithm, covering its definition, the scikit-learn library support, a practical example, the underlying theory, and a summary, aiming to help readers understand both usage and principles.
2. References
Bayesian classifier – Wikipedia
3. Introduction to Naive Bayes Classifier
Bayesian classification is based on Bayes' theorem; the Naive Bayes variant is the simplest and most common approach. It is intuitive, computationally cheap, and widely applied across many domains.
4. Naive Bayes Library Overview
The GaussianNB class implements a Gaussian Naive Bayes classifier. Its only main parameter is priors , representing prior class probabilities. If not provided, priors are estimated as mk/m , where m is the total number of training samples and mk is the number belonging to class k .
5. Example Application
1. Import Packages
#coding=utf-8
import numpy as np
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB2. Load Sample Data
# Use the built‑in iris dataset
iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
Y = iris.target3. Split Training and Test Sets
# Randomly permute indices
shuffled_indices = np.random.permutation(len(X))
test_set_size = int(len(X) * 0.25)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
train_sample = X[train_indices]
train_result = Y[train_indices]
test_sample = X[test_indices]
test_result = Y[test_indices]4. Train the Model
# Initialize the classifier
clf = GaussianNB()
# Fit the training data
clf.fit(train_sample, train_result)5. Predict on Test Data
# Generate predictions
predict_data = clf.predict(test_sample)6. Show Results
print(predict_data)
print(test_result)Sample output:
[1 2 0 2 1 0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 2 0 1 2 2 0 1 2 1 1 0 2 1 0 0 1]
[1 2 0 2 1 0 0 1 2 1 0 2 0 1 1 1 1 1 0 1 0 2 0 1 2 2 0 1 2 1 1 0 2 1 0 0 1]
7. Compute Accuracy
diff = 0.0
for num in range(0, len(predict_data)):
if predict_data[num] != test_result[num]:
diff = diff + 1
rate = diff / len(predict_data)
print(1 - rate)Accuracy obtained: 0.945945945946
6. Theory of Naive Bayes
Classification is a common everyday task. Naive Bayes computes the posterior probability P(class | features) by assuming feature independence and applying Bayes' theorem.
7. Example Problem Analysis
Given a small dataset of 12 records (6 "marry", 6 "not marry"), we calculate prior and conditional probabilities for four features (not handsome, bad personality, short, not ambitious) and compute the posterior probabilities for both outcomes. The calculations show that P(not marry | features) > P(marry | features) , so the classifier would predict "not marry".
8. Summary
Advantages of Naive Bayes:
Rooted in solid mathematical theory with stable classification efficiency.
Performs well on small datasets, supports multi‑class tasks, and is suitable for incremental training on large data.
Robust to missing data and simple to implement, often used for text classification.
Disadvantages of Naive Bayes:
Assumes feature independence, which may not hold for many real‑world problems, reducing accuracy when features are correlated.
Requires prior probabilities, which can be difficult to estimate accurately.
Prediction accuracy can be sensitive to the representation of input data (continuous vs. discrete).
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.