Master Linear Discriminant Analysis (LDA) with Python: Theory & Code
This article explains Linear Discriminant Analysis (LDA) as a pattern‑recognition technique that projects data onto a low‑dimensional space to maximize class separation, details its mathematical formulation with between‑class and within‑class scatter matrices, and provides a complete Python implementation using scikit‑learn on the Iris dataset, including visualization of the results.
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is a common pattern‑recognition and statistical analysis technique typically used for classification problems. LDA aims to maximize between‑class distance and minimize within‑class distance by projecting data onto a linear subspace of lower dimension, enabling effective classification.
LDA first computes the mean vector and covariance matrix for each class, then solves a matrix eigenvalue problem to obtain a linear transformation matrix that maps the data to a new low‑dimensional space. The dimensionality of this new space is usually the number of classes minus one, allowing high‑dimensional data to be represented in a more manageable form.
In LDA, the projected low‑dimensional data can be used to build a classification model. When new data arrives, it is projected into the low‑dimensional space and classified using the model trained on the transformed data.
Advantages of LDA include simplicity, good interpretability, and the ability to handle high‑dimensional data; its main drawback is reduced performance on non‑linear classification problems.
Mathematical Model
LDA’s mathematical expression is derived from maximizing between‑class scatter and minimizing within‑class scatter. Two matrices are defined: between‑class scatter matrix (SB) and within‑class scatter matrix (SW) .
SB measures the dispersion between different class means (the covariance of class mean vectors). SW measures the dispersion within each class (the sum of covariances of samples around their class mean).
The goal is to find a linear transformation matrix W that maps the original high‑dimensional data to a lower‑dimensional space where SB is maximized and SW is minimized. This problem is solved by maximizing the generalized Rayleigh quotient.
The mathematical formulation is:
where W is the linear transformation matrix to be solved, and SB and SW are the between‑class and within‑class scatter matrices respectively.
For a binary classification example with class sample counts N1 and N2 and data dimension d , the within‑class and between‑class scatter matrices can be expressed as:
where the covariance matrices of the two classes and their mean vectors are used.
By solving the generalized Rayleigh quotient, the optimal W is obtained, projecting data into a low‑dimensional space where simple classifiers such as nearest‑centroid can be applied.
Python Implementation
We use the Iris dataset, split it into feature matrix X and label vector y , and employ scikit‑learn’s LinearDiscriminantAnalysis to create an LDA model that projects the data into a 2‑dimensional space.
<code>from sklearn.datasets import load_iris
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Create LDA model and fit data
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)
</code>Finally, we visualize the result with matplotlib . Running the code shows the Iris dataset projected onto a 2‑dimensional space with clear class separation.
<code># Visualize result
import matplotlib.pyplot as plt
plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y)
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.show()
</code>Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.