Artificial Intelligence 10 min read

Understanding the k-Nearest Neighbor (kNN) Classification Algorithm and Its Python Implementation

This article introduces the concept and intuition behind the k-Nearest Neighbor (kNN) classification algorithm, explains its simple and full forms, discusses feature engineering and Euclidean distance calculations, and provides a complete Python implementation with example code.

Python Programming Learning Circle

May 7, 2020

Understanding the k-Nearest Neighbor (kNN) Classification Algorithm and Its Python Implementation

Classification is a routine human activity, and the article uses it to motivate the k-Nearest Neighbor (kNN) algorithm as a simple yet powerful method for categorizing objects based on similarity.

The core idea of kNN is that a sample belongs to the class whose representative examples are closest to it, typically measured by a distance metric; the smaller the distance, the higher the similarity.

Two concrete forms are described: the basic nearest‑neighbor (NN) approach, which selects a single nearest example, and the general kNN approach, which considers the average distance to the k nearest representatives of each class (e.g., a 4‑NN classifier).

To apply kNN in practice, the article outlines two essential steps of feature engineering: constructing meaningful feature vectors (e.g., height, weight, tail length) and choosing an appropriate distance function. Euclidean distance is presented with its formula and illustrated with sample calculations.

Potential improvements are suggested, such as applying dimensionality‑reduction techniques like PCA, experimenting with alternative distance measures (e.g., cosine similarity), and expanding the training set when resources allow.

Finally, a complete Python implementation of kNN is provided, including class definition, model fitting, prediction, distance computation, and a simple data‑generation routine. The code is wrapped in

tags to preserve its original formatting:</p><code>import numpy as np

# 一个最简单的KNN
class KNN():
    def __init__(self):
        self.model = {}  # 存储各个类别的训练样本的特征，key为类别标签，value是一个list，元素为样本的特征向量
        self.training_sample_num = {}  # 存储训练数据中，各个类别的数量

    # 训练模型，输入是标签列表，和对应的输入数据列表
    def fit(self, X, Y):
        for i in range(len(Y)):
            # 将训练数据按照类别分组
            if Y[i] in self.model:
                self.model[Y[i]].append(X[i])
            else:
                self.model[Y[i]] = [X[i]]
            # 各个类别的样本总数
            self.training_sample_num[Y[i]] = self.training_sample_num.get(Y[i], 0) + 1

    # 预测/判断一个样本的类别。这里模仿sklearn的风格，允许输入单个样本，也允许输入多个样本
    def predict(self, X):
        result = None
        if type(X[0]) == list:
            for x in X:
                result.append(self.predict_one(x))
        else:
            result = self.predict_one(X)
        return result

    # 判断单个样本的类别
    def predict_one(self, x):
        label = None  # 类别标签
        min_d = None  # 目前为止，待分类样本与各类代表性样本的最小平均距离
        for class_label in self.model:  # 遍历每个类别的代表性样本
            sum_d = 0  # 待分类样本与本类别的代表性样本的距离之和
            for sample in self.model[class_label]:  # 遍历这个类别下所有的代表性样本
                sum_d += self.distance(x, sample)  # 累计
            mean_d = sum_d / self.training_sample_num[class_label]  # 待分类样本与本类别的代表性样本的平均距离
            if min_d == None or mean_d <= min_d:  # 如果遍历到第一个类别，或者待分类样本与当前类别的平均距离比之前的更低，更新类标签与最小距离
                label = class_label
                min_d = mean_d
        return label

    # 计算两个样本之间的距离
    def distance(self, x1, x2, type="eu"):
        d = None
        if type == "eu":
            d = np.sum((x1 - x2) ** 2)  # 欧氏距离
        return d

    # 评估模型效果
    def evaluate(self, X, Y):
        pass

    # 制造训练数据
    def generate_training_data(self):
        data_str = """类别    样本唯一标识    身高(cm)    体重(kg)    尾巴长度(cm)
A    高秀敏    160    60    0
A    赵本山    170    70    0
A    范伟    170    70    0
A    宋丹丹    160    60    0
B    孙悟空    120    40    100
B    大熊猫    100    100    10
B    大象    300    3000    50
B    老鼠    10    0.1    10"""
        print(data_str)
        data_list = data_str.split('
')
        X = []
        Y = []
        for line in data_list[1:]:
            data = line.split('    ')
            Y.append(data[0])
            X.append(list(map(lambda x: float(x), data[2:])))
        return X, Y

if __name__ == '__main__':
    M = KNN()
    train_X, train_Y = M.generate_training_data()
    print(train_X)
    M.fit(train_X, train_Y)
    xiao_li_zi = np.array([175, 75, 0])
    res = M.predict(xiao_li_zi)
    print(res)

The conclusion emphasizes that kNN is often the first classification algorithm learned in data mining or machine learning courses, valued for its intuitive nature, ease of implementation, and pedagogical usefulness despite being less efficient than more advanced methods.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning feature engineering classification kNN euclidean distance

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.