Understanding Probability Distributions: From Gaussian Curves to Bayesian Modeling
This article explains the concept of probability distributions, describes the Gaussian (normal) distribution with its parameters, demonstrates how to visualize it using Python code, and discusses random variables, independence, and real‑world examples such as atmospheric CO₂ time‑series data.
Probability Distribution
Probability distribution is a mathematical concept used to describe the likelihood of different events occurring , typically within a set that represents all possible events.
In statistics, this can be understood as: data are generated from a probability distribution with unknown parameters . Since the exact parameters are unknown, we use Bayes' theorem to infer them from the observed data. Probability distributions form the foundation of Bayesian models, and combining different distributions yields useful complex models.
The most common probability distribution is the Gaussian (normal) distribution, whose mathematical formula is:
In the formula, μ and σ are the two parameters of the Gaussian distribution. The first parameter μ is the mean (also the median and mode) and can take any real value; the second parameter σ is the standard deviation, measuring dispersion, and must be positive. Because μ and σ can take infinitely many values, there are infinitely many Gaussian instances. Although the formula is concise, it may not be intuitive, so we can use Python code to illustrate it.
<code>import matplotlib.pyplot as plt
import seaborn as sn
import numpy as np
from scipy import stats
mu_params = [-1,0,1]
sd_params = [0.5,1,1.5]
x = np.linspace(-7,7,100)
f,ax = plt.subplots(len(mu_params),len(sd_params),sharex=True,sharey=True,figsize=(20,10))
for i in range(3):
for j in range(3):
mu = mu_params[i]
sd = sd_params[j]
y = stats.norm(mu,sd).pdf(x)
ax[i,j].plot(x,y)
ax[i,j].plot(0,0,label='$\mu$={:3.2f}\n $\sigma$={:3.2f}'.format(mu,sd),alpha=0.5)
ax[i,j].legend(fontsize=12)
ax[2,1].set_xlabel('$x$',fontsize=16)
ax[1,0].set_ylabel('$y$',fontsize=16)
plt.tight_layout()
</code>Variables generated from a probability distribution (e.g., ...) are called random variables. Although they can take many values, their observed values are constrained by the distribution, and their randomness follows a Gaussian distribution with parameters μ and σ, which can be expressed as:
where the symbol ⟂ denotes “is distributed as”.
Random variables are of two types: continuous and discrete. Continuous random variables can take any value within an interval (represented by Python floating‑point numbers), while discrete random variables can only take specific values (represented by integers).
Many models assume that if multiple random variables are sampled independently from the same distribution, the samples are independent and identically distributed (i.i.d.). Mathematically, two random variables X and Y are independent if for all possible values the joint probability equals the product of their marginal probabilities.
Time series are a typical example that does not satisfy the i.i.d. assumption. In time‑series data, the temporal dimension requires special attention. The example below shows CO₂ concentration data from CDIAC spanning 1959 to 1997.
Each point in the figure represents a monthly measurement of atmospheric CO₂; the data exhibit seasonal growth and a long‑term upward trend.
Reference: Osvaldo Martin, “Python Bayesian Analysis”
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.