How Neural Networks Learn: Gradient Descent and Loss Functions
This article explains how neural networks learn by using labeled training data, describing the role of weights, biases, activation functions, and how gradient descent iteratively adjusts parameters to minimize loss, illustrated with the MNIST digit‑recognition example.
In the previous lesson we examined the structure of neural networks; now we discuss how a network learns by looking at large amounts of labeled training data, using the core idea of gradient descent, which underlies both neural network learning and many other machine‑learning methods.
As a reminder, our goal is to recognize handwritten digits – the classic "Hello World" example for neural networks.
Each digit is rendered on a 28×28 pixel grid, where each pixel holds a grayscale value between 0.0 and 1.0. The 784 values become the activation values of the neurons in the network’s input layer.
Each pixel value becomes the activation of a specific neuron in the first layer.
Neuron activations in each layer are computed as the weighted sum of all activations from the previous layer plus a bias term, followed by a non‑linear function such as sigmoid or ReLU.
Values propagate from one layer to the next, entirely determined by weights and biases. In the output layer, the brightest of the ten neurons indicates the network’s chosen digit.
The input‑layer activations are based on pixel values; the remaining layers depend on the activations of the preceding layer.
With two hidden layers of 16 neurons each, the network contains 13,002 adjustable weights and biases, which fully determine its behavior.
We designed this hierarchical structure so that the second layer might detect edges, the third layer might recognize patterns such as loops and lines, and the final layer combines these patterns to identify the digit.
Our aim is to let the network learn by receiving many labeled examples of handwritten digits and adjusting its 13,002 weights and biases to improve performance on those examples.
The labeled images are called "training data".
We want the learned knowledge to generalize beyond the training set, so after training we test the network on unseen, labeled images and observe its classification accuracy.
The MNIST database provides tens of thousands of labeled handwritten digit images that are freely available.
Although describing a machine as "learning" can be controversial, the process is essentially a calculus exercise: finding the minimum of a specific function.
Cost Function
The network’s behavior is determined by all its weights and biases. Weights represent connection strengths between neurons, while biases indicate a neuron’s tendency to be active.
Initially, all weights and biases are set to random numbers, so the network performs poorly on the training examples.
For example, feeding an image of the digit 3 may produce a chaotic output.
To tell the computer that its performance is bad, we define a loss (cost) function, which essentially says, "No, the output is far from the desired values."
Mathematically, the loss for a single training example is the sum of squared differences between each output activation and its expected value.
If the network confidently classifies an image correctly, the loss is small; if it is clueless, the loss is large.
Loss Over Multiple Examples
We care about the average loss across all tens of thousands of training examples, which serves as a metric of how badly the network performs.
The network itself is a function with 784 inputs, 10 outputs, and 13,002 parameters.
The loss function takes those 13,002 parameters as input and outputs a single number describing how bad they are, based on the network’s behavior on all labeled training data.
Minimizing the Loss Function
Simply telling the computer that it performs poorly is not enough; we must tell it how to adjust the 13,002 weights and biases to improve.
To build intuition, imagine a simple one‑dimensional loss function that maps a single number to another number.
Finding the input that minimizes this function analytically requires solving for a point where the slope is zero, which is infeasible for a high‑dimensional function with 13,002 inputs.
A more flexible strategy is to start from a random point, compute the slope (gradient) at that point, and move in the direction that reduces the loss. If the slope is negative, move right; if positive, move left.
Repeating this step—checking the new slope and moving accordingly—gradually brings us toward a local minimum, analogous to a ball rolling down a hill.
In higher dimensions the notion of a single slope is replaced by a gradient vector, which points in the direction of steepest ascent. Moving opposite to this vector yields the steepest descent.
The step size is controlled by the learning rate η; a larger η makes bigger steps, which can speed up convergence but also risk overshooting the minimum.
For our 13,002‑dimensional loss function the idea is identical: the negative gradient is a 13,002‑element vector indicating how each weight and bias should be adjusted to most rapidly decrease loss.
Each component of the negative gradient tells us whether to increase or decrease the corresponding weight or bias, and the magnitude indicates how important that parameter is for reducing loss.
Efficient computation of the gradient vector is performed by the backpropagation algorithm, which we will explore in the next lesson.
In summary, when we say a network "learns," we mean that it adjusts its weights and biases via gradient descent to minimize the loss function, thereby improving performance on the training data and, ideally, on unseen data as well.
Translated from: https://www.3blue1brown.com/lessons/gradient-descent
Cognitive Technology Team
Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.