Artificial Intelligence 12 min read

Introduction to PCA with scikit-learn: A Dimensionality Reduction Tutorial

This article explains why dimensionality reduction is needed, introduces scikit-learn's PCA class and its parameters, provides step‑by‑step Python code examples for generating data, visualising samples, computing variance ratios, applying different n_components settings, and finally discusses the mathematical intuition and algorithmic workflow of Principal Component Analysis.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Introduction to PCA with scikit-learn: A Dimensionality Reduction Tutorial

Why use dimensionality reduction?

All machine‑learning methods rely on sample features; when the number of features becomes very large, model training and inference become computationally expensive. Reducing the feature space simplifies calculations while preserving most of the information. Principal Component Analysis (PCA) is one of the most important techniques for achieving this.

scikit-learn PCA class overview

In scikit‑learn, PCA‑related classes reside in the sklearn.decomposition package. The most commonly used class is sklearn.decomposition.PCA , which implements the classic PCA algorithm.

Key parameters of sklearn.decomposition.PCA

The class requires little tuning. The main parameter is n_components , which can be:

An integer ≥ 1 specifying the exact number of components to keep.

A float in (0, 1] indicating the desired proportion of explained variance.

The string "mle" , which lets the algorithm automatically select the number of components using a maximum‑likelihood estimator.

Two useful attributes are explained_variance_ (the variance of each component) and explained_variance_ratio_ (the proportion of total variance explained by each component).

PCA example

1. Import libraries

#coding=utf-8
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import make_blobs

2. Generate sample data

X, y = make_blobs(n_samples=10, n_features=3,
    centers=[[3,3,3],[0,0,0],[1,1,1],[2,2,2]],
    cluster_std=[0.2,0.1,0.2,0.2], random_state=9)

3. Visualise the 3‑D samples

fig = plt.figure()
ax = Axes3D(fig, rect=[0,0,1,1], elev=30, azim=20)
plt.scatter(X[:,0], X[:,1], X[:,2], marker='o')
plt.show()

4. Compute variance ratio without reduction

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(X)
print(pca.explained_variance_ratio_)

The output is [0.98288449 0.00874633 0.00836918] , indicating that the first component alone captures about 98 % of the variance. Setting n_components=2 retains roughly 99.2 % of the variance.

5. Reduce to two dimensions and visualise

pca = PCA(n_components=2)
pca.fit(X)
X_new = pca.transform(X)
plt.scatter(X_new[:,0], X_new[:,1], marker='o')
plt.show()

6. Reduce by variance threshold (0.95)

pca = PCA(n_components=0.95)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.n_components_)

The algorithm keeps only the first component because it already exceeds the 95 % threshold.

7. Let MLE choose the number of components

pca = PCA(n_components='mle')
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.n_components_)

MLE also selects a single component for this dataset, confirming that the first principal direction dominates the variance.

The intuition behind PCA

PCA seeks a set of orthogonal axes (principal components) that capture the maximum variance of the data. Mathematically, this is equivalent to solving an eigen‑value decomposition of the covariance matrix and selecting the eigenvectors associated with the largest eigenvalues.

PCA algorithm workflow

Center all samples by subtracting the mean.

Compute the covariance matrix of the centered data.

Perform eigen‑value decomposition (or singular value decomposition) of the covariance matrix.

Select the top d' eigenvectors, normalise them, and form the projection matrix W .

Project each original sample x_i onto the new subspace: z_i = W^T x_i .

The resulting lower‑dimensional dataset retains as much of the original variance as possible while discarding redundant or noisy dimensions.

Summary of PCA variants

PCA is a widely used unsupervised dimensionality‑reduction method because it only requires eigen‑value decomposition, making it easy to implement. Variants such as Kernel PCA, Incremental PCA, and Sparse PCA address non‑linearity, memory constraints, and sparsity, respectively. Advantages include simplicity and variance‑based information preservation; drawbacks involve loss of interpretability of the original features and possible discarding of low‑variance but informative signals.

PythonPCAscikit-learndimensionality reduction
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.