Implementing Linear Regression from Scratch in Python
This tutorial walks through the complete process of building a linear regression model in Python from loading a housing price dataset, normalizing features, defining hypothesis, cost and gradient‑descent functions, visualising data and cost convergence, and testing predictions, with full source code provided.
Linear regression remains a foundational technique for data scientists, offering interpretability and wide applicability across business, biological, and industrial datasets. This article demonstrates how to implement a multivariate linear regression model in Python, using a simple house‑price dataset containing house size, number of rooms, and price.
The load_data function reads a CSV file, assigns column names, converts the data to a NumPy array, plots the raw features, normalises them, and returns the feature matrix x and target vector y .
def load_data(filename):
df = pd.read_csv(filename, sep=",", index_col=False)
df.columns = ["housesize", "rooms", "price"]
data = np.array(df, dtype=float)
plot_data(data[:,:2], data[:, -1])
normalize(data)
return data[:,:2], data[:, -1]Feature normalisation uses the standard score Z = (x - μ) / σ to bring all features onto a comparable scale, preventing bias toward features with larger numeric ranges.
def normalize(data):
for i in range(0, data.shape[1]-1):
data[:, i] = (data[:, i] - np.mean(data[:, i])) / np.std(data[:, i])
mu.append(np.mean(data[:, i]))
std.append(np.std(data[:, i]))The hypothesis for multivariate linear regression is expressed as h(x, θ) = x · θ , where x includes an intercept column of ones.
def h(x, theta):
return np.matmul(x, theta)The cost function measures the mean squared error between predictions and actual values.
def cost_function(x, y, theta):
return ((h(x, theta) - y).T @ (h(x, theta) - y)) / (2 * y.shape[0])Gradient descent iteratively updates the parameter vector θ by moving opposite to the gradient of the cost, scaled by a learning rate, and records the cost at each epoch.
def gradient_descent(x, y, theta, learning_rate=0.1, num_epochs=10):
m = x.shape[0]
J_all = []
for _ in range(num_epochs):
h_x = h(x, theta)
cost_ = (1/m) * (x.T @ (h_x - y))
theta = theta - learning_rate * cost_
J_all.append(cost_function(x, y, theta))
return theta, J_allUtility functions plot_data and plot_cost visualise the raw data and the cost‑versus‑epoch curve, respectively.
def plot_data(x, y):
plt.xlabel('house size')
plt.ylabel('price')
plt.plot(x[:,0], y, 'bo')
plt.show() def plot_cost(J_all, num_epochs):
plt.xlabel('Epochs')
plt.ylabel('Cost')
plt.plot(num_epochs, J_all, 'm', linewidth='5')
plt.show()After loading the data, the script reshapes y , adds an intercept column to x , initialises θ to zeros, sets a learning rate of 0.1 and runs 50 epochs of gradient descent. The final cost, learned parameters, and a cost plot are printed.
A test function demonstrates how to predict the price of a new house by normalising its features with the previously stored means and standard deviations and applying the learned hypothesis.
def test(theta, x):
x[0] = (x[0] - mu[0]) / std[0]
x[1] = (x[1] - mu[1]) / std[1]
y = theta[0] + theta[1] * x[0] + theta[2] * x[1]
print("Price of house:", y)The article concludes that mastering this end‑to‑end implementation equips readers with a solid foundation for exploring more advanced machine‑learning algorithms.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.