How Simple Linear Regression Predicts Outcomes: Model, Assumptions, and Evaluation
This article explains the simple linear regression model, its six key assumptions, how to evaluate the fit using ANOVA and the coefficient of determination, and covers hypothesis testing and confidence intervals for regression coefficients with practical examples.
Simple Linear Regression Model
The simple linear regression model is expressed as y = \(\beta_0\) + \(\beta_1\)x + \(\varepsilon\) , where x is the independent variable, y is the dependent variable, and \(\varepsilon\) denotes the residual (error) term.
Given a set of sample points, the least‑squares method finds a line with intercept \(\beta_0\) and slope \(\beta_1\) (the hats indicate estimated values). The intercept represents the expected value of y when x is zero, and the slope indicates the change in y for a one‑unit increase in x .
The purpose of regression is to predict the dependent variable; once the intercept and slope are estimated, a predicted value of y can be obtained from any predicted value of x .
Example 1: An analyst regresses a company’s sales‑growth rate on the GDP growth rate, obtaining an intercept of α and a slope of 2. If the national statistical bureau forecasts a GDP growth of g this year, the expected sales‑growth rate can be calculated.
Assumptions of Simple Linear Regression
The model relies on six assumptions:
The independent and dependent variables have a linear relationship.
The expected value of the residuals is zero, meaning residuals are equally likely to be positive or negative around the regression line.
The independent variable and the residuals are uncorrelated.
The residuals have constant variance (homoscedasticity).
The residuals are uncorrelated with each other (no autocorrelation).
The residuals follow a normal distribution.
Analysis of Variance (ANOVA)
After fitting a regression model, ANOVA is used to assess its quality. The ANOVA table (shown below) provides the sums of squares needed for evaluation.
From the ANOVA table we obtain the coefficient of determination (R²) and the standard error of estimate, which indicate model fit. The degrees of freedom for regression equal the number of independent variables (1 in simple regression); the error degrees of freedom equal the sample size minus the number of parameters; total degrees of freedom are the sum of the two.
Total sum of squares (SST) represents overall variability; regression sum of squares (SSR) represents variability explained by the model; error sum of squares (SSE) represents unexplained variability.
Mean squares are obtained by dividing each sum of squares by its respective degrees of freedom.
Coefficient of Determination
R² = SSR / SST . It measures the proportion of the dependent variable’s variation explained by the independent variable. A larger R² indicates a better fitting model. For simple regression, R² also equals the square of the sample correlation between x and y.
Standard Error of Estimate
The standard error of estimate (SEE) equals the square root of the residual mean square (MSE). A smaller SEE suggests a more accurate regression model.
From this table, the coefficient of determination and the standard error can be read directly.
Hypothesis Testing of Regression Coefficients
We test whether the intercept and slope equal specific constants, typically testing if the slope equals zero. Failure to reject the null hypothesis (slope = 0) implies little linear relationship between x and y.
The test statistic follows a t‑distribution with appropriate degrees of freedom; the slope’s standard error is used in the calculation.
Example 2: A regression yields intercept b0 and slope b1 with standard errors SE(b0) and SE(b1) . At a significance level of α , we compute the t‑statistics to assess significance.
In the example, the intercept’s t‑statistic does not exceed the critical value, so we cannot reject the null hypothesis for the intercept. The slope’s t‑statistic exceeds the critical value, leading us to reject the null hypothesis and conclude the slope is significantly different from zero.
Confidence Intervals for Regression Coefficients
Confidence intervals are constructed similarly to hypothesis tests: estimate ± critical value × standard error . For the slope, the critical value is based on the t‑distribution with the appropriate degrees of freedom.
Using the example estimates, the 95% confidence interval for the intercept is [L1, U1] , which includes zero, indicating the intercept may not be significantly different from zero. The slope’s interval [L2, U2] excludes zero, confirming its significance.
References
Zhu Shunquan, Economic and Financial Data Analysis and Its Python Application
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.