Artificial Intelligence 10 min read

Understanding the k-Nearest Neighbor (kNN) Classification Algorithm and Its Python Implementation

This article introduces the concept and intuition behind the k-Nearest Neighbor (kNN) classification algorithm, explains its simple and full forms, discusses feature engineering and Euclidean distance calculations, and provides a complete Python implementation with example code.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Understanding the k-Nearest Neighbor (kNN) Classification Algorithm and Its Python Implementation

Classification is a routine human activity, and the article uses it to motivate the k-Nearest Neighbor (kNN) algorithm as a simple yet powerful method for categorizing objects based on similarity.

The core idea of kNN is that a sample belongs to the class whose representative examples are closest to it, typically measured by a distance metric; the smaller the distance, the higher the similarity.

Two concrete forms are described: the basic nearest‑neighbor (NN) approach, which selects a single nearest example, and the general kNN approach, which considers the average distance to the k nearest representatives of each class (e.g., a 4‑NN classifier).

To apply kNN in practice, the article outlines two essential steps of feature engineering: constructing meaningful feature vectors (e.g., height, weight, tail length) and choosing an appropriate distance function. Euclidean distance is presented with its formula and illustrated with sample calculations.

Potential improvements are suggested, such as applying dimensionality‑reduction techniques like PCA, experimenting with alternative distance measures (e.g., cosine similarity), and expanding the training set when resources allow.

Finally, a complete Python implementation of kNN is provided, including class definition, model fitting, prediction, distance computation, and a simple data‑generation routine. The code is wrapped in tags to preserve its original formatting:</p><code>import numpy as np # 一个最简单的KNN class KNN(): def __init__(self): self.model = {} # 存储各个类别的训练样本的特征,key为类别标签,value是一个list,元素为样本的特征向量 self.training_sample_num = {} # 存储训练数据中,各个类别的数量 # 训练模型,输入是标签列表,和对应的输入数据列表 def fit(self, X, Y): for i in range(len(Y)): # 将训练数据按照类别分组 if Y[i] in self.model: self.model[Y[i]].append(X[i]) else: self.model[Y[i]] = [X[i]] # 各个类别的样本总数 self.training_sample_num[Y[i]] = self.training_sample_num.get(Y[i], 0) + 1 # 预测/判断一个样本的类别。这里模仿sklearn的风格,允许输入单个样本,也允许输入多个样本 def predict(self, X): result = None if type(X[0]) == list: for x in X: result.append(self.predict_one(x)) else: result = self.predict_one(X) return result # 判断单个样本的类别 def predict_one(self, x): label = None # 类别标签 min_d = None # 目前为止,待分类样本与各类代表性样本的最小平均距离 for class_label in self.model: # 遍历每个类别的代表性样本 sum_d = 0 # 待分类样本与本类别的代表性样本的距离之和 for sample in self.model[class_label]: # 遍历这个类别下所有的代表性样本 sum_d += self.distance(x, sample) # 累计 mean_d = sum_d / self.training_sample_num[class_label] # 待分类样本与本类别的代表性样本的平均距离 if min_d == None or mean_d <= min_d: # 如果遍历到第一个类别,或者待分类样本与当前类别的平均距离比之前的更低,更新类标签与最小距离 label = class_label min_d = mean_d return label # 计算两个样本之间的距离 def distance(self, x1, x2, type="eu"): d = None if type == "eu": d = np.sum((x1 - x2) ** 2) # 欧氏距离 return d # 评估模型效果 def evaluate(self, X, Y): pass # 制造训练数据 def generate_training_data(self): data_str = """类别 样本唯一标识 身高(cm) 体重(kg) 尾巴长度(cm) A 高秀敏 160 60 0 A 赵本山 170 70 0 A 范伟 170 70 0 A 宋丹丹 160 60 0 B 孙悟空 120 40 100 B 大熊猫 100 100 10 B 大象 300 3000 50 B 老鼠 10 0.1 10""" print(data_str) data_list = data_str.split('\n') X = [] Y = [] for line in data_list[1:]: data = line.split(' ') Y.append(data[0]) X.append(list(map(lambda x: float(x), data[2:]))) return X, Y if __name__ == '__main__': M = KNN() train_X, train_Y = M.generate_training_data() print(train_X) M.fit(train_X, train_Y) xiao_li_zi = np.array([175, 75, 0]) res = M.predict(xiao_li_zi) print(res) The conclusion emphasizes that kNN is often the first classification algorithm learned in data mining or machine learning courses, valued for its intuitive nature, ease of implementation, and pedagogical usefulness despite being less efficient than more advanced methods.

machine learningfeature engineeringclassificationkNNEuclidean distance
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.