Artificial Intelligence 8 min read

Efficient Python Implementation of K-Means Clustering with Performance Comparison

This article introduces a concise Python implementation of the k‑means clustering algorithm, compares its speed with a typical implementation, provides full source code, a data‑generation helper, and demonstrates the results on a synthetic dataset of 10,000 points.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Efficient Python Implementation of K-Means Clustering with Performance Comparison

The author explains that machine learning can be divided into four areas—classification, clustering, regression, and dimensionality reduction—and that k‑means, while simple, is an excellent introductory exercise.

A custom k‑means implementation written in fewer than 20 lines processes 10,000 samples in about 20 ms for ten iterations, which is over a hundred times faster than many popular versions.

import numpy as np def kmeans_xufive(ds, k): """k-means聚类算法 k - 指定分簇数量 ds - ndarray(m, n),m个样本的数据集,每个样本n个属性值 """ m, n = ds.shape # m:样本数量,n:每个样本的属性值个数 result = np.empty(m, dtype=np.int) # m个样本的聚类结果 cores = np.empty((k, n)) # k个质心 cores = ds[np.random.choice(np.arange(m), k, replace=False)] # 随机选择k个样本作为质心 while True: # 迭代计算 d = np.square(np.repeat(ds, k, axis=0).reshape(m, k, n) - cores) distance = np.sqrt(np.sum(d, axis=2)) # ndarray(m, k),每个样本距离k个质心的距离 index_min = np.argmin(distance, axis=1) # 每个样本最近的质心索引 if (index_min == result).all(): # 如果聚类没有改变 return result, cores # 返回结果和质心 result[:] = index_min # 重新分类 for i in range(k): # 遍历质心 items = ds[result == i] cores[i] = np.mean(items, axis=0) # 更新质心位置

The article also presents a more conventional k‑means version (kmeans_open) that includes helper functions for loading data, computing Euclidean distance, initializing random centroids, and the full clustering loop.

import numpy as np def loadDataSet(fileName): data = np.loadtxt(fileName, delimiter='\t') return data def distEclud(x, y): return np.sqrt(np.sum((x - y) ** 2)) def randCent(dataSet, k): m, n = dataSet.shape centroids = np.zeros((k, n)) for i in range(k): index = int(np.random.uniform(0, m)) centroids[i, :] = dataSet[index, :] return centroids def kmeans_open(dataSet, k): m = np.shape(dataSet)[0] clusterAssment = np.mat(np.zeros((m, 2))) clusterChange = True centroids = randCent(dataSet, k) while clusterChange: clusterChange = False for i in range(m): minDist = 1e5 minIndex = -1 for j in range(k): distance = distEclud(centroids[j, :], dataSet[i, :]) if distance < minDist: minDist = distance minIndex = j if clusterAssment[i, 0] != minIndex: clusterChange = True clusterAssment[i, :] = minIndex, minDist ** 2 for j in range(k): pointsInCluster = dataSet[np.nonzero(clusterAssment[:, 0].A == j)[0]] centroids[j, :] = np.mean(pointsInCluster, axis=0) return clusterAssment.A[:, 0], centroids

A utility function create_data_set generates synthetic data for testing, allowing the user to specify multiple centroids and the number of points around each.

def create_data_set(*cores): """生成k-means聚类测试用数据集""" ds = [] for x0, y0, z0 in cores: x = np.random.normal(x0, 0.1 + np.random.random() / 3, z0) y = np.random.normal(y0, 0.1 + np.random.random() / 3, z0) ds.append(np.stack((x, y), axis=1)) return np.vstack(ds)

The testing script creates a dataset with four clusters (2,500 points each), runs both implementations, measures execution time, and visualizes the clustering results with Matplotlib.

import time import matplotlib.pyplot as plt k = 4 ds = create_data_set((0,0,2500), (0,2,2500), (2,0,2500), (2,2,2500)) # Test custom implementation t0 = time.time() result, cores = kmeans_xufive(ds, k) t = time.time() - t0 plt.scatter(ds[:,0], ds[:,1], s=1, c=result.astype(np.int)) plt.scatter(cores[:,0], cores[:,1], marker='x', c=np.arange(k)) plt.show() print('custom kmeans_xufive: %.3f seconds' % t) # Test conventional implementation t0 = time.time() result, cores = kmeans_open(ds, k) t = time.time() - t0 plt.scatter(ds[:,0], ds[:,1], s=1, c=result.astype(np.int)) plt.scatter(cores[:,0], cores[:,1], marker='x', c=np.arange(k)) plt.show() print('conventional kmeans_open: %.3f seconds' % t)

Running the tests shows the custom version completes in roughly 0.016 seconds, while the conventional version takes about 4 seconds for the same 10,000‑point dataset, demonstrating a substantial speed advantage.

Performancemachine learningPythonClusteringNumPyk-means
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.