Artificial Intelligence 14 min read

Master Gradient Descent: From Intuition to Advanced Variants

This comprehensive guide explains the mathematical foundation, intuitive intuition, algorithmic steps, tuning strategies, and variants of gradient descent, comparing it with other optimization methods and illustrating its use in machine‑learning models such as linear regression.

Model Perspective

Oct 7, 2022

When solving model parameters of machine learning algorithms (i.e., unconstrained optimization problems), gradient descent is one of the most commonly used methods, alongside the least‑squares method. This article provides a complete summary of gradient descent.

Gradient

The gradient of a multivariate function is the vector of its partial derivatives. For a function f(x), the gradient ∇f points in the direction of the steepest increase of the function; the opposite direction –∇f points to the steepest decrease, which is useful for finding minima.

Gradient Descent and Gradient Ascent

In machine‑learning, minimizing a loss function is done by iteratively applying gradient descent, while maximizing a function uses gradient ascent. The two procedures are interchangeable: solving a minimization problem with descent is equivalent to solving the corresponding maximization problem with ascent.

Gradient Descent Algorithm Details

Intuitive Explanation

Imagine standing on a mountain and wanting to reach the foot. At each position you compute the gradient and take a step opposite to it—the steepest downhill direction. Repeating this leads to a local minimum; if the loss surface is convex, the solution is globally optimal.

Related Concepts

Learning rate : the step size taken along the negative gradient at each iteration.

Feature : an input variable of a sample (e.g., x₁, x₂ for a two‑feature example).

Hypothesis function : the model used to fit the inputs, denoted h(·). For a single‑feature linear model, h(x)=θ₀+θ₁x.

Loss function : a metric that measures how well the model fits the data; for linear regression it is usually the sum of squared errors.

Detailed Algorithm

Algebraic Description

Prerequisite: define the hypothesis and loss functions. For linear regression, h(x)=θᵀx and the loss is L(θ)=∑(h(xᵢ)-yᵢ)².

Initialize parameters (θ), termination distance ε, and learning rate α. A common naive initialization is θ=0 and α=1.

Iterative process:

Compute the gradient of the loss with respect to each parameter.

Multiply the gradient by the learning rate to obtain the update step.

If the magnitude of all updates is smaller than ε, stop; otherwise continue.

Update the parameters: θ←θ‑α·∇L(θ) and repeat from step 1.

An explicit linear‑regression example shows how the gradient is derived from all samples, how the update rule simplifies to a constant factor when α and the number of samples are fixed, and how the algorithm converges.

Matrix Description

The same algorithm can be expressed with vectors and matrices, which is more concise once matrix calculus is known.

Prerequisite: represent the hypothesis as h = Xθ, where X is the data matrix (m samples × n features) and θ is the parameter vector.

Initialize θ, ε, and α.

Iterative process:

Compute the gradient: ∇L = Xᵀ(Xθ‑y).

Update step: θ←θ‑α·∇L.

Check termination: if ‖α·∇L‖<ε stop; otherwise continue.

Algorithm Tuning

Learning‑rate selection : try several values (large to small) and observe loss reduction; too large may overshoot, too small slows convergence.

Initial parameter values : different starts can lead to different local minima; run the algorithm multiple times with varied initializations.

Feature normalization : scale each feature to have zero mean and unit variance, which speeds up convergence.

Family of Gradient Descent (BGD, SGD, MBGD)

Batch Gradient Descent (BGD)

Uses all m samples to compute the gradient at each iteration; this is the classic form described above.

Stochastic Gradient Descent (SGD)

Computes the gradient using a single randomly chosen sample per iteration. It converges faster per epoch but introduces high variance, which may prevent reaching the exact optimum.

Mini‑batch Gradient Descent (MBGD)

Uses a subset of the data (1 < batch size < m) per iteration, combining the stability of BGD with the speed of SGD. Typical batch sizes are around 10–100, depending on the dataset.

Comparison with Other Unconstrained Optimization Algorithms

Besides gradient descent, other methods include the least‑squares closed‑form solution, Newton’s method, and quasi‑Newton methods. Gradient descent requires a learning rate and is iterative, whereas least‑squares provides an analytical solution but becomes impractical for very large datasets. Newton‑type methods converge faster because they use second‑order information, but each iteration is computationally more expensive.

Source: Liu Jianping Pinard Blog (https://www.cnblogs.com/pinard/p/5970503.html)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Machine Learning gradient descent linear regression learning rate

Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.