Artificial Intelligence 11 min read

Introduction to Naive Bayes Classifier with scikit-learn

This article introduces the Naive Bayes classification algorithm, explains its theoretical basis, demonstrates how to use scikit-learn's GaussianNB class with Python code, evaluates model performance, and discusses advantages, limitations, and practical examples of the method.

Qunar Tech Salon

Jan 16, 2019

Introduction to Naive Bayes Classifier with scikit-learn

1. Overview

This article explains the Naive Bayes classification algorithm, covering its definition, the scikit-learn library support, a practical example, the underlying theory, and a summary, aiming to help readers understand both usage and principles.

2. References

Bayesian classifier – Wikipedia

3. Introduction to Naive Bayes Classifier

Bayesian classification is based on Bayes' theorem; the Naive Bayes variant is the simplest and most common approach. It is intuitive, computationally cheap, and widely applied across many domains.

4. Naive Bayes Library Overview

The GaussianNB class implements a Gaussian Naive Bayes classifier. Its only main parameter is priors, representing prior class probabilities. If not provided, priors are estimated as mk/m, where m is the total number of training samples and mk is the number belonging to class k.

5. Example Application

1. Import Packages

#coding=utf-8
import numpy as np
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB

2. Load Sample Data

# Use the built‑in iris dataset
iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
Y = iris.target

3. Split Training and Test Sets

# Randomly permute indices
shuffled_indices = np.random.permutation(len(X))
test_set_size = int(len(X) * 0.25)

test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]

train_sample = X[train_indices]
train_result = Y[train_indices]

test_sample = X[test_indices]
test_result = Y[test_indices]

4. Train the Model

# Initialize the classifier
clf = GaussianNB()
# Fit the training data
clf.fit(train_sample, train_result)

5. Predict on Test Data

# Generate predictions
predict_data = clf.predict(test_sample)

6. Show Results

print(predict_data)
print(test_result)

Sample output:

[1 2 0 2 1 0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 2 0 1 2 2 0 1 2 1 1 0 2 1 0 0 1]

[1 2 0 2 1 0 0 1 2 1 0 2 0 1 1 1 1 1 0 1 0 2 0 1 2 2 0 1 2 1 1 0 2 1 0 0 1]

7. Compute Accuracy

diff = 0.0
for num in range(0, len(predict_data)):
    if predict_data[num] != test_result[num]:
        diff = diff + 1
rate = diff / len(predict_data)
print(1 - rate)

Accuracy obtained:

0.945945945946

6. Theory of Naive Bayes

Classification is a common everyday task. Naive Bayes computes the posterior probability P(class | features) by assuming feature independence and applying Bayes' theorem.

7. Example Problem Analysis

Given a small dataset of 12 records (6 "marry", 6 "not marry"), we calculate prior and conditional probabilities for four features (not handsome, bad personality, short, not ambitious) and compute the posterior probabilities for both outcomes. The calculations show that P(not marry | features) > P(marry | features), so the classifier would predict "not marry".

8. Summary

Advantages of Naive Bayes:

Rooted in solid mathematical theory with stable classification efficiency.

Performs well on small datasets, supports multi‑class tasks, and is suitable for incremental training on large data.

Robust to missing data and simple to implement, often used for text classification.

Disadvantages of Naive Bayes:

Assumes feature independence, which may not hold for many real‑world problems, reducing accuracy when features are correlated.

Requires prior probabilities, which can be difficult to estimate accurately.

Prediction accuracy can be sensitive to the representation of input data (continuous vs. discrete).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning Python classification Naive Bayes

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.