A Comprehensive Guide to Dimensionality Reduction Algorithms with Python Implementations
This article introduces eleven classic dimensionality reduction techniques—including PCA, LDA, MDS, LLE, and t‑SNE—explains their principles, advantages, and limitations, and provides complete Python code examples and resources for each method, making it a valuable guide for beginners in machine learning and data mining.
Various dimensionality reduction algorithms are often scattered online, many without source code. This article collects a GitHub project that implements 11 classic data extraction (dimensionality reduction) algorithms in Python, such as PCA, LDA, MDS, LLE, and t‑SNE, and provides related documentation and visual results, making it suitable for machine‑learning beginners and newcomers to data mining.
Why Perform Dimensionality Reduction?
Dimensionality reduction replaces a high‑dimensional vector \(X\) with a lower‑dimensional vector \(Z\) (where \(d < D\)), preserving the useful information. Real‑world datasets often have hundreds or thousands of dimensions, leading to the "curse of dimensionality"; reducing dimensions makes data easier to use, ensures variable independence, lowers computational cost, and removes noise. It is widely applied in text processing, face recognition, image classification, and natural language processing.
Principles of Dimensionality Reduction
High‑dimensional data are usually sparse; reduction removes redundant, invalid, or duplicate information. For example, a 1024×1024 image with a 50×50 non‑zero center can discard the zero‑valued regions. Classical reduction methods are divided into linear and nonlinear approaches; nonlinear methods further split into kernel‑based and eigen‑value‑based techniques.
Linear methods: PCA, ICA, LDA, LFA, LPP (linear representation of LE)
Non‑linear kernel‑based: KPCA, KICA, KDA
Non‑linear eigen‑value‑based (manifold learning): ISOMAP, LLE, LE, LPP, LTSA, MVU
Principal Component Analysis (PCA) Algorithm
PCA is a fundamental unsupervised linear reduction method that projects data onto directions of maximum variance (or minimum reconstruction error). Proposed by Karl Pearson in 1901, it follows the maximum variance theory or minimum error theory.
The algorithm steps are:
Input: data matrix \(X_{m\times n}\).
Center the data: \(X_{new}=X - \text{mean}(X)\).
Compute the covariance matrix of \(X_{new}\).
Obtain eigenvalues and eigenvectors of the covariance matrix.
Sort eigenvalues descending, select the top \(k\) eigenvectors to form matrix \(W_{n\times k}\).
Project the data: \(X_{new}W\) yields the reduced dataset.
The minimum‑error perspective seeks the linear projection that minimizes average projection cost.
Detailed steps can be found in "From Scratch Implementing PCA" (https://blog.csdn.net/u013719780/article/details/78352262).
PCA Code Implementation (Python)
from __future__ import print_function
from sklearn import datasets
import matplotlib.pyplot as plt
import matplotlib.cm as cmx
import matplotlib.colors as colors
import numpy as np
%matplotlib inline
def shuffle_data(X, y, seed=None):
if seed:
np.random.seed(seed)
idx = np.arange(X.shape[0])
np.random.shuffle(idx)
return X[idx], y[idx]
# Normalize dataset X
def normalize(X, axis=-1, p=2):
lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))
lp_norm[lp_norm == 0] = 1
return X / np.expand_dims(lp_norm, axis)
# Standardize dataset X
def standardize(X):
X_std = np.zeros(X.shape)
mean = X.mean(axis=0)
std = X.std(axis=0)
for col in range(np.shape(X)[1]):
if std[col]:
X_std[:, col] = (X[:, col] - mean[col]) / std[col]
return X_std
# Split dataset into train and test sets
def train_test_split(X, y, test_size=0.2, shuffle=True, seed=None):
if shuffle:
X, y = shuffle_data(X, y, seed)
n_train_samples = int(X.shape[0] * (1 - test_size))
x_train, x_test = X[:n_train_samples], X[n_train_samples:]
y_train, y_test = y[:n_train_samples], y[n_train_samples:]
return x_train, x_test, y_train, y_test
# Compute covariance matrix of X
def calculate_covariance_matrix(X, Y=np.empty((0,0))):
if not Y.any():
Y = X
n_samples = np.shape(X)[0]
covariance_matrix = (1 / (n_samples-1)) * (X - X.mean(axis=0)).T.dot(Y - Y.mean(axis=0))
return np.array(covariance_matrix, dtype=float)
# Compute variance of each column of X
def calculate_variance(X):
n_samples = np.shape(X)[0]
variance = (1 / n_samples) * np.diag((X - X.mean(axis=0)).T.dot(X - X.mean(axis=0)))
return variance
# Compute standard deviation of each column of X
def calculate_std_dev(X):
std_dev = np.sqrt(calculate_variance(X))
return std_dev
# Compute correlation matrix
def calculate_correlation_matrix(X, Y=np.empty([0])):
covariance_matrix = calculate_covariance_matrix(X, Y)
std_dev_X = np.expand_dims(calculate_std_dev(X), 1)
std_dev_y = np.expand_dims(calculate_std_dev(Y), 1)
correlation_matrix = np.divide(covariance_matrix, std_dev_X.dot(std_dev_y.T))
return np.array(correlation_matrix, dtype=float)
class PCA():
"""Principal Component Analysis (unsupervised)"""
def __init__(self):
self.eigen_values = None
self.eigen_vectors = None
self.k = 2
def transform(self, X):
covariance = calculate_covariance_matrix(X)
self.eigen_values, self.eigen_vectors = np.linalg.eig(covariance)
idx = self.eigen_values.argsort()[::-1]
eigenvalues = self.eigen_values[idx][:self.k]
eigenvectors = self.eigen_vectors[:, idx][:, :self.k]
X_transformed = X.dot(eigenvectors)
return X_transformed
def main():
data = datasets.load_iris()
X = data.data
y = data.target
X_trans = PCA().transform(X)
x1 = X_trans[:, 0]
x2 = X_trans[:, 1]
cmap = plt.get_cmap('viridis')
colors = [cmap(i) for i in np.linspace(0, 1, len(np.unique(y)))]
class_distr = []
for i, l in enumerate(np.unique(y)):
_x1 = x1[y == l]
_x2 = x2[y == l]
class_distr.append(plt.scatter(_x1, _x2, color=colors[i]))
plt.legend(class_distr, y, loc=1)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
if __name__ == "__main__":
main()The resulting plot shows the two‑dimensional representation of the Iris dataset. While PCA is widely used, it has limitations: it cannot handle high‑order correlations well and assumes that principal features lie on orthogonal axes, which may not hold for non‑orthogonal variance structures.
Other Dimensionality Reduction Algorithms and Resources
KPCA (Kernel PCA) : Extends PCA by applying a kernel function before computing the covariance matrix, enabling nonlinear mappings. Resources and code: https://github.com/heucoder/dimensionality_reduction_alo_codes/blob/master/codes/PCA/KPCA.py
LDA (Linear Discriminant Analysis) : Maximizes between‑class variance while minimizing within‑class variance, useful for classification tasks. Resources and code: https://github.com/heucoder/dimensionality_reduction_alo_codes/tree/master/codes/LDA
MDS (Multidimensional Scaling) : Preserves pairwise distances when projecting to lower dimensions. Both iterative and non‑iterative implementations are provided. Code: https://github.com/heucoder/dimensionality_reduction_alo_codes/tree/master/codes/MDS
ISOMAP : Improves MDS by constructing a neighborhood graph to preserve geodesic distances on manifolds. Code: https://github.com/heucoder/dimensionality_reduction_alo_codes/tree/master/codes/ISOMAP
LLE (Locally Linear Embedding) : Reconstructs each point from its neighbors and preserves these local linear relationships in the low‑dimensional embedding. Code: https://github.com/heucoder/dimensionality_reduction_alo_codes/tree/master/codes/LLE
t‑SNE : Non‑linear technique especially suited for visualizing high‑dimensional data in 2‑D or 3‑D. Code and examples are linked in the article.
LE (Laplacian Eigenmaps) : Similar to LLE, it builds a graph based on local neighborhoods and solves a generalized eigenproblem to obtain the embedding. Code: https://github.com/heucoder/dimensionality_reduction_alo_codes/tree/master/codes/LE
LPP (Locality Preserving Projections) : Linear counterpart of LE that seeks a projection matrix preserving local structure. Code: https://github.com/heucoder/dimensionality_reduction_alo_codes/tree/master/codes/LPP
The project author, Heucoder, is a master's student at Harbin Institute of Technology, active in the internet community, and maintains the GitHub repository https://github.com/heucoder/dimensionality_reduction_alo_codes where all the above code can be accessed.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.