Artificial Intelligence 13 min read

Calculating Common Classification Evaluation Metrics Using Confusion Matrix with sklearn, TensorFlow, and Manual Methods

This tutorial explains how to compute accuracy, precision, recall, F1‑score, and ROC‑AUC from a confusion matrix using sklearn, TensorFlow, and hand‑crafted Python code, illustrating each metric with example data and visualizations.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Calculating Common Classification Evaluation Metrics Using Confusion Matrix with sklearn, TensorFlow, and Manual Methods

Classification Evaluation Metrics

Continuing from the previous article on confusion‑matrix visualization, this article demonstrates how to compute common evaluation metrics—accuracy, precision, recall, F1‑score, and ROC‑AUC—using three approaches: sklearn , TensorFlow , and manual calculations based on a hand‑crafted confusion matrix.

Imports

import numpy as np
import pandas as pd
import sklearn.metrics
import tensorflow as tf
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, accuracy_score, RocCurveDisplay

Accuracy

Accuracy measures the proportion of correctly classified samples among all samples; it can be misleading on imbalanced data.

sklearn.metrics.accuracy_score

# set prediction results
pred = [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]
# set true labels
true = [0, 1, 2, 3, 1, 5, 0, 1, 2, 3, 1, 5, 0, 1, 2, 3, 4, 5]
accuracy = sklearn.metrics.accuracy_score(y_true=true, y_pred=pred)
print(accuracy)

Output: 0.8888888888888888

tf.keras.metrics.Accuracy

accuracy = tf.keras.metrics.Accuracy()
accuracy.update_state(y_true=true, y_pred=pred)
print(accuracy.result().numpy())

Output: 0.8888889

Precision

Precision evaluates, for a specific class, how many predicted positives are truly positive.

sklearn.metrics.precision_score

precision = sklearn.metrics.precision_score(y_true=true, y_pred=pred, average='macro')
print(precision)

Output: 0.8888888888888888

tf.keras.metrics.Precision

precision = tf.keras.metrics.Precision()
precision.update_state(y_true=tf.one_hot(true, 6), y_pred=tf.one_hot(pred, 6))
print(precision.result().numpy())

Output: 0.8888889

Recall

Recall (sensitivity) measures, for a specific class, how many actual positives are correctly identified.

sklearn.metrics.recall_score

recall = sklearn.metrics.recall_score(y_true=true, y_pred=pred, average='macro')
print(recall)

Output: 0.8888888888888888

tf.keras.metrics.Recall

recall = tf.keras.metrics.Recall()
recall.update_state(y_true=tf.one_hot(true, 6), y_pred=tf.one_hot(pred, 6))
print(recall.result().numpy())

Output: 0.8888889

F1‑Score

F1‑score is the harmonic mean of precision and recall, providing a single measure of a test's accuracy.

sklearn.metrics.f1_score

f1 = sklearn.metrics.f1_score(y_true=true, y_pred=pred, average='macro')
print(f1)

Output: 0.875

tf.keras.metrics.F1Score

f1 = tf.keras.metrics.F1Score(average='macro')
f1.update_state(y_true=tf.one_hot(true, 6), y_pred=tf.one_hot(pred, 6))
print(f1.result().numpy())

Output: 0.875

Manual Calculations from a Hand‑Crafted Confusion Matrix

The article also shows how to build a confusion matrix with sklearn.metrics.confusion_matrix , extract TP, TN, FP, FN for each class, and compute the metrics manually.

Construct Confusion Matrix

cm = sklearn.metrics.confusion_matrix(y_true=true, y_pred=pred)
print(cm)
# total samples
total = np.sum(cm)
# sum of diagonal (correct predictions)
line = np.sum([cm[i, i] for i in range(len(cm))])
classes_list = []
for i in range(len(cm)):
    TP = cm[i, i]
    TN = line - TP
    FP = sum(cm[:, i]) - TP
    FN = total - TP - TN - FP
    classes_list.append({i: {'tp': TP, 'tn': TN, 'fp': FP, 'fn': FN}})
print(classes_list)

Using the extracted values, the script computes accuracy, precision, recall, and F1 for each class and averages them (shown in the original code snippets).

ROC‑AUC Curve

ROC‑AUC evaluates the trade‑off between true‑positive rate and false‑positive rate across thresholds. The article generates a synthetic multi‑class dataset with make_classification , trains a One‑Vs‑Rest logistic regression model, and plots ROC curves for each class.

Data Generation and Model Training

n_classes = 6
x, y = make_classification(n_samples=1000, n_features=32, n_informative=16, n_classes=n_classes, class_sep=2)
y = label_binarize(y, classes=range(n_classes))
train_x, valid_x, train_y, valid_y = train_test_split(x, y, test_size=0.3)
model = OneVsRestClassifier(LogisticRegression())
output = model.fit(train_x, train_y).decision_function(valid_x)
pred = model.predict(valid_x)
print('Accuracy:', accuracy_score(valid_y, pred))

The ROC curve is plotted by computing fpr , tpr , and auc for each class and using RocCurveDisplay to visualize them.

Plotting ROC Curves

fpr = {}
tpr = {}
roc_auc = {}
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(valid_y[:, i], output[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

fig, ax = plt.subplots()
ax.plot([0, 1], '--', linewidth=2)
for i in range(n_classes):
    display = RocCurveDisplay(fpr=fpr[i], tpr=tpr[i], roc_auc=roc_auc[i], estimator_name=f'Class {i}')
    display.plot(ax=ax)

plt.title('ROC-AUC')
plt.show()

The resulting plot shows each class's ROC curve and its AUC value, illustrating model performance across thresholds.

recallTensorFlowPrecisionaccuracyconfusion matrixF1 scoreROC AUCsklearn
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.