Top Clustering Algorithms in Python with scikit-learn: A Comprehensive Tutorial
This tutorial explains clustering as an unsupervised learning task, outlines why no single algorithm fits all data, and provides step‑by‑step Python code using scikit‑learn to install the library, generate synthetic datasets, and apply ten popular clustering algorithms with visualizations.
Clustering is an unsupervised learning problem that discovers natural groups in feature space; many algorithms exist and there is no universally best method. This tutorial shows how to install scikit‑learn and use its top clustering algorithms in Python.
After completing the tutorial you will understand that clustering finds natural groups in data, that many different algorithms exist for any dataset, and how to implement, configure, and apply these algorithms with the scikit‑learn library.
The tutorial is divided into three parts: (1) an introduction to clustering concepts, (2) an overview of clustering algorithms, and (3) practical code examples for each algorithm.
Part 1 – Clustering Basics : Clustering groups similar instances in feature space, often forming dense regions that can be used for market segmentation, anomaly detection, or feature engineering. Because clustering is unsupervised, evaluating results is subjective and may require domain expertise.
Part 2 – Clustering Algorithms : Before clustering, data should be scaled. Ten widely used algorithms are covered: AffinityPropagation, AgglomerativeClustering, Birch, DBSCAN, KMeans, MiniBatchKMeans, MeanShift, OPTICS, SpectralClustering, and GaussianMixture.
Part 3 – Algorithm Examples :
Library installation:
sudo pip install scikit-learnCheck installed version:
# 检查 scikit-learn 版本
import sklearn
print(sklearn.__version__)Generate a synthetic two‑dimensional classification dataset (used by all examples):
# 综合分类数据集
from numpy import where
from sklearn.datasets import make_classification
from matplotlib import pyplot
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
for class_value in range(2):
row_ix = where(y == class_value)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()AffinityPropagation:
# 亲和力传播聚类
from sklearn.cluster import AffinityPropagation
model = AffinityPropagation(damping=0.9)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()AgglomerativeClustering:
# 聚合聚类
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=2)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()Birch:
# birch聚类
from sklearn.cluster import Birch
model = Birch(threshold=0.01, n_clusters=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()DBSCAN:
# dbscan 聚类
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.30, min_samples=9)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()KMeans:
# k-means 聚类
from sklearn.cluster import KMeans
model = KMeans(n_clusters=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()MiniBatchKMeans:
# mini-batch k均值聚类
from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()MeanShift:
# 均值漂移聚类
from sklearn.cluster import MeanShift
model = MeanShift()
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()OPTICS:
# optics聚类
from sklearn.cluster import OPTICS
model = OPTICS(eps=0.8, min_samples=10)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()SpectralClustering:
# spectral clustering
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()GaussianMixture:
# 高斯混合模型
from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()Each example fits the model on the synthetic dataset, predicts cluster assignments, and visualizes the results with a scatter plot. The tutorial concludes by summarizing that clustering discovers natural groups, that many algorithms exist without a single best choice, and that scikit‑learn provides straightforward implementations for all of them.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.