Explaining Image Recognition: Logistic Regression and Convolutional Neural Networks
This article introduces the principles of image recognition, compares traditional logistic regression with convolutional neural networks, demonstrates their implementation using Python code, visualizes model weights, and explains key concepts such as padding, convolution, pooling, receptive fields, and multi‑layer feature extraction.
1. Introduction
There are many image‑recognition methods, including traditional logistic regression, AdaBoost, convolutional neural networks (CNN), and Transformers. Modern algorithms have surpassed human accuracy, but understanding how they achieve recognition remains a challenge. This article studies the interpretability of two algorithms—logistic regression and CNN—to explain image‑recognition principles.
2. Logistic Regression
Logistic regression is a simple linear model that is efficient for basic tasks and highly interpretable, making it a good starting point for discussing image‑recognition principles.
2.1 Logistic Regression Principle
Logistic regression adds a sigmoid function to linear regression. The model is expressed as:
where X and W are vectors, and the output y lies in [0,1]; values ≥0.5 are classified as class 1, otherwise class 0. Training aims to find optimal W and b .
For image classification, X is the image flattened into a one‑dimensional vector. This flattening discards spatial information, a limitation addressed by CNNs.
2.2 Logistic Regression Implementation
Scikit‑learn provides an implementation. The following code trains a logistic‑regression model on the 8×8 digit dataset (digits 0 and 1):
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
X, y = load_digits(n_class=2, return_X_y=True)
lr = LogisticRegression()
lr.fit(X, y)After training, the weight matrix and bias can be inspected:
print(lr.coef_.shape)
print(lr.intercept_.shape)
# Output
# (1, 64)
# (1,)The number of weights matches the number of image pixels.
2.3 Logistic Regression Image‑Classification Principle
The weight vector W aligns with each pixel, indicating its contribution to the classification. Pixels belonging to class 1 receive positive weights, while those of class 0 receive negative weights.
An illustrative example uses two synthetic classes with distinct white‑region locations. After training, the weight distribution reflects these spatial patterns, which can be reshaped to the original image shape for visualization.
Reshaping and displaying the coefficients:
img = lr.coef_.reshape((8, 8))
plt.imshow(img)
plt.show()The resulting heatmap shows higher weights in regions corresponding to digit 1 and lower weights where digit 0 appears.
3. Convolutional Neural Networks
3.1 CNN Overview
Compared with logistic regression, CNNs are more complex but can also be explained. A CNN extracts features through convolutional layers and then classifies them, similar to logistic regression.
A single convolution operation consists of three steps: Padding, Convolution, and Pooling.
3.1.1 Padding
Padding adds a border of zeros around the image to preserve spatial dimensions after convolution. For example, a 5×5 image becomes 7×7 after padding.
3.1.2 Convolution
The convolution kernel (a learned matrix) slides over the image, computing dot products at each position to produce a feature map. Larger dot‑product values indicate higher similarity between the kernel and the local region.
3.1.3 Pooling
Pooling (e.g., MaxPooling) reduces spatial resolution by selecting the maximum value within a region, providing translation invariance and reducing computational cost.
3.2 Receptive Field
Small kernels (e.g., 3×3) capture low‑level features such as edges. Stacking convolution and pooling layers enlarges the receptive field, enabling the network to recognize larger, more complex patterns.
3.3 CNN Digit Recognition Example
An example with digits 1 and 2 demonstrates how multiple convolution kernels generate distinct feature maps. The first‑layer kernels produce five feature maps for each digit; the second layer combines these maps with a multi‑channel kernel to form higher‑level representations.
For digit 1, the second‑layer output might be:
[[2, 6, 3, 4, 4],
[4, 0, 3, 3, 4]]For digit 2, the output could be:
[[2, 4, 3, 3, 4],
[4, 0, 3, 4, 4]]The sum of elements in each vector indicates similarity to the corresponding digit, and a final fully‑connected layer (akin to logistic regression) can make the final classification.
By extending the depth of the network, CNNs can recognize far more complex objects such as cats and dogs.
Past Highlights:
A One‑Page Overview of Twelve Major Deep Neural Networks
Front‑End AI: Pose Prediction Tutorial
How to Implement Text‑to‑Image Search
Ten Image Dehazing Algorithms: Principles and Comparisons
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.