Artificial Intelligence 19 min read

Comprehensive Overview of Machine Learning Model Evaluation Metrics

This article provides a comprehensive summary of machine learning model evaluation metrics, covering accuracy, precision, recall, F1, RMSE, ROC/AUC, KS test, and scoring cards, with explanations, formulas, code examples, and practical considerations for model performance assessment.

DataFunTalk
DataFunTalk
DataFunTalk
Comprehensive Overview of Machine Learning Model Evaluation Metrics

This article provides a complete summary of machine learning model evaluation metrics. In typical ML pipelines, datasets are split into training and test sets; evaluation metrics determine how we measure model quality for classification, ranking, regression, and sequence‑prediction tasks.

1. Accuracy

Accuracy is the simplest evaluation metric, calculated as the proportion of correctly predicted samples. It suffers from two major drawbacks: when class distribution is imbalanced, the metric is dominated by the majority class, and it gives a coarse view that may ignore the performance on a specific class of interest.

Related metric: Error Rate – the proportion of mis‑classified samples.

from sklearn.metrics import accuracy_score

y_pred = [0, 0, 1, 1]
y_true = [1, 0, 1, 0]
accuracy_score(y_true, y_pred)  # 0.5

2. Precision, Recall and F1

Precision (also called positive predictive value) measures the proportion of predicted positive samples that are truly positive.

Recall (also called sensitivity or true positive rate) measures the proportion of actual positive samples that are correctly predicted.

In ranking problems, Precision@N and Recall@N are computed on the top‑N results.

Precision and recall are often inversely related; improving one usually lowers the other. They are tightly linked to the confusion matrix:

Actual

Predicted Positive

Predicted Negative

Positive

TP (True Positive)

FN (False Negative)

Negative

FP (False Positive)

TN (True Negative)

From the confusion matrix we obtain:

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

The Precision‑Recall curve plots Recall on the x‑axis and Precision on the y‑axis. To draw it, sort the model’s predicted probabilities, treat each threshold as a cut‑off, compute the corresponding precision and recall, and plot the points.

from typing import List, Tuple
import matplotlib.pyplot as plt

def get_confusion_matrix(y_pred: List[int], y_true: List[int]) -> Tuple[int, int, int, int]:
    assert len(y_pred) == len(y_true)
    tp = fp = fn = tn = 0
    for i in range(len(y_pred)):
        if y_pred[i] == y_true[i] == 1:
            tp += 1
        elif y_pred[i] == y_true[i] == 0:
            tn += 1
        elif y_pred[i] == 1 and y_true[i] == 0:
            fp += 1
        elif y_pred[i] == 0 and y_true[i] == 1:
            fn += 1
    return tp, fp, tn, fn

def calc_p(tp: int, fp: int) -> float:
    return tp / (tp + fp)

def calc_r(tp: int, fn: int) -> float:
    return tp / (tp + fn)

def get_pr_pairs(y_pred_prob: List[float], y_true: List[int]) -> Tuple[List[float], List[float]]:
    ps, rs = [1], [0]
    for prob1 in y_pred_prob:
        y_pred_i = [1 if prob2 >= prob1 else 0 for prob2 in y_pred_prob]
        tp, fp, tn, fn = get_confusion_matrix(y_pred_i, y_true)
        ps.append(calc_p(tp, fp))
        rs.append(calc_r(tp, fn))
    ps.append(0)
    rs.append(1)
    return ps, rs

y_pred_prob = [0.9, 0.8, 0.7, 0.6, 0.55, 0.54, 0.53, 0.52, 0.51, 0.505, 0.4, 0.39, 0.38, 0.37, 0.36, 0.35, 0.34, 0.33, 0.3, 0.1]
y_true = [1,1,0,1,1,1,0,0,1,0,1,0,1,0,0,0,1,0,1,0]
ps, rs = get_pr_pairs(y_pred_prob, y_true)
fig, ax = plt.subplots(figsize=(12,5))
ax.plot(rs, ps)

When multiple models are available, each model’s PR curve can be plotted on the same axes. If one curve completely encloses another, the enclosing model is superior. If curves intersect, the area under the curve (AUC) is often used for comparison.

3. RMSE

Root Mean Square Error (RMSE) is primarily used for regression models. It is the square root of the average squared difference between predicted and actual values. RMSE is sensitive to outliers; strategies to mitigate this include removing noisy outliers, rebuilding the model, or using alternative metrics such as Mean Absolute Percentage Error (MAPE).

4. ROC and AUC

The Receiver Operating Characteristic (ROC) curve plots the False Positive Rate (FPR) on the x‑axis against the True Positive Rate (TPR) on the y‑axis. The curve is generated similarly to the PR curve by varying the probability threshold.

def calc_fpr(fp: int, tn: int) -> float:
    return fp / (fp + tn)

def calc_tpr(tp: int, fn: int) -> float:
    return tp / (tp + fn)

def get_ftpr_pairs(y_pred_prob: List[float], y_true: List[int]) -> Tuple[List[float], List[float]]:
    fprs, tprs = [0], [0]
    for prob1 in y_pred_prob:
        y_pred_i = [1 if prob2 >= prob1 else 0 for prob2 in y_pred_prob]
        tp, fp, tn, fn = get_confusion_matrix(y_pred_i, y_true)
        fprs.append(calc_fpr(fp, tn))
        tprs.append(calc_tpr(tp, fn))
    fprs.append(1)
    tprs.append(1)
    return fprs, tprs

fprs, tprs = get_ftpr_pairs(y_pred_prob, y_true)
fig, ax = plt.subplots(figsize=(12,5))
ax.plot(fprs, tprs)

The Area Under the ROC Curve (AUC) ranges from 0.5 (random guessing) to 1 (perfect classifier). A higher AUC indicates better ranking of positive samples. Unlike PR curves, ROC curves are relatively stable when class distribution changes.

5. KS (Kolmogorov‑Smirnov) Test

KS measures the maximum distance between the empirical cumulative distribution functions (ECDF) of two samples. It is widely used in credit‑risk modeling to assess a model’s discriminative power. A large KS value (e.g., >0.75) suggests strong separation between good and bad customers, while a small value (<0.2) indicates poor discrimination.

from scipy import stats
rvs1 = stats.norm.rvs(size=200, loc=0., scale=1)
rvs2 = stats.norm.rvs(size=300, loc=0.5, scale=1.5)
stats.ks_2samp(rvs1, rvs2)
# Result: KS statistic ≈ 0.265, p‑value ≈ 7e‑08 → reject H0 (different distributions)

KS is equivalent to the maximum vertical distance between the ROC curve and the diagonal; the maximum KS value can be roughly related to AUC by AUC ≈ 0.5 × max_KS.

6. Scorecard Models

Scorecards are linear models often used in financial risk scoring. They provide high feature coverage, stability, and interpretability. Non‑linear features can be handled via Weight of Evidence (WOE) encoding or binning, while interaction effects are captured through segment‑wise modeling.

6.1 Non‑linear Processing

WOE transforms a categorical or binned numeric variable into a continuous value based on the log odds of bad versus good outcomes. Binning groups continuous variables into intervals that exhibit stronger linear relationships with the target.

6.2 Interaction Features

Customer segmentation can be used to build separate models for each subgroup, effectively creating interaction features.

Thank you for reading.

For more discussions, join the DataFunTalk Machine Learning community.

machine learningrecallmodel evaluationprecisionAUCaccuracyKSROC
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.