Artificial Intelligence 15 min read

Introduction to scikit-learn for Machine Learning: Ensemble Learning – Random Forest Algorithm

This article provides a comprehensive introduction to the Random Forest algorithm, covering its theoretical background, scikit-learn implementation details, practical coding example with the Iris dataset, and a discussion of its advantages, limitations, and typical use cases in machine learning.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Introduction to scikit-learn for Machine Learning: Ensemble Learning – Random Forest Algorithm

1. Overview

This article mainly explains the Random Forest algorithm, including (1) an introduction to the algorithm, (2) an overview of the scikit-learn library for Random Forest, (3) a practical coding example, (4) the underlying principles, and (5) a summary. The goal is to enable readers to use Random Forest and understand its workings.

2. References

1. Supervised Learning – Decision Trees ; 2. scikit-learn Random Forest documentation; 3. Wikipedia entry for Random Forest; 4. Book "Machine Learning" by Zhou Zhihua.

3. Ensemble Learning – Random Forest Basics

1. Introduction to Ensemble Learning

Ensemble learning trains several base learners and combines them with a specific strategy to form a strong learner, leveraging the strengths of each individual model.

Ensemble learning focuses on two problems: (1) how to obtain multiple base learners, and (2) how to combine them into a strong learner.

2. Individual Learners in Ensemble Learning

Base learners can be homogeneous (same type, e.g., all decision trees) or heterogeneous (different types, e.g., logistic regression and Naïve Bayes). Most practical ensembles use homogeneous learners. Homogeneous learners are further divided into two categories: those with strong dependencies (requiring sequential generation, e.g., Boosting) and those without dependencies (can be generated in parallel, e.g., Bagging and Random Forest).

Boosting workflow:

(1) Train a weak learner on the training set with initial weights. (2) Increase the weights of mis‑classified samples. (3) Train the next weak learner on the re‑weighted data. (4) Repeat steps (2) and (3) until the predefined number of weak learners T is reached. (5) Combine the T weak learners using a voting/averaging strategy to obtain the final strong model.

Bagging workflow:

(1) Perform T bootstrap samplings to obtain T sample sets. (2) Train a weak learner independently on each sample set. (3) Combine the T weak learners to form the final strong model.

3. Combination Strategies

Combination methods are roughly divided into three categories: averaging (used for regression), voting (used for classification), and learning‑based stacking. Voting includes simple majority, absolute majority (requires >50% of votes), and weighted voting. Stacking trains a meta‑learner on the predictions of the base learners.

4. Random Forest Introduction

Random Forest is an improved Bagging algorithm that is highly parallelizable, making it well‑suited for large‑scale data.

5. Random Forest Library in scikit-learn

scikit-learn provides two classes: RandomForestClassifier for classification and RandomForestRegressor for regression. Parameters are divided into Bagging framework parameters and CART decision‑tree parameters.

1. Bagging Framework Parameters

(1) n_estimators : Number of trees. Too few leads to under‑fitting; too many increases computation with diminishing returns. Default is 100. (2) oob_score : Whether to use out‑of‑bag samples to estimate model performance. Default is False ; setting it to True is recommended. (3) criterion : Split quality measure for CART trees. For classification the default is Gini impurity; for regression it is mean squared error.

2. CART Decision‑Tree Parameters

(1) max_features : Maximum number of features considered at each split. Default "auto" uses √N features. (2) max_depth : Maximum depth of each tree. Not setting it allows unlimited depth; for large datasets a depth between 10‑100 is advisable. (3) min_samples_split : Minimum number of samples required to split an internal node. Default is 2; increase for very large datasets. (4) min_samples_leaf : Minimum number of samples required at a leaf node. Default is 1; can be increased for large datasets.

6. Random Forest Practical Example

1. Import Libraries

#coding=utf-8
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier

2. Load Sample Data

# Use the built‑in Iris dataset
iris = datasets.load_iris()
# Load data into a DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)

3. Split Train/Test

# Randomly mark 75% of samples as training data
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
# Add true class labels
df['species'] = iris.target
# Separate train and test sets
train = df[df['is_train'] == True]
test = df[df['is_train'] == False]
# Feature columns (first four columns)
features = df.columns[:4]

4. Train Model

# Train a Random Forest with max depth 5 and 10 trees
clf = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
# Convert class labels to integer codes
Y, _ = pd.factorize(train['species'])
# Fit the model
clf.fit(train[features], Y)

5. Predict and Validate

# Predict on the test set
preds = clf.predict(test[features])

6. Obtain Results

print(preds)
print(test['species'].values)

7. Compute Accuracy

diff = 0.0
for num in range(len(preds)):
    if preds[num] != test['species'].values[num]:
        diff = diff + 1
rate = diff / len(preds)
print(1 - rate)

Accuracy obtained: 0.945945945946

7. Random Forest Principle

1. Principle

Random Forest performs bootstrap sampling (sampling with replacement) from the training set. Each of the T bootstrap samples contains the same number of instances as the original set, and some instances may appear multiple times while others may be omitted. Approximately 36.8% of the original data are not sampled (out‑of‑bag data) and can be used to estimate the model’s generalization ability.

For classification, Random Forest aggregates predictions by majority voting; for regression, it averages the predictions.

2. Algorithm Flow

Given a dataset D = {(x, y₁), …, (x, y_m)} and a weak‑learner algorithm, the Random Forest builds T trees as follows:

(1) Perform a bootstrap sample of size m to obtain D_t. (2) Train a weak learner on D_t. (3) For classification, the final prediction is the class receiving the most votes among the T trees; for regression, the final output is the arithmetic mean of the T predictions.

8. Summary

Random Forest is a highly parallelizable algorithm that excels with large‑scale data. Advantages: (1) Training can be parallelized, offering speed for big data; (2) Provides feature importance; (3) Low variance and strong generalization due to random sampling; (4) Robust to missing features. Disadvantages: (1) Acts as a black‑box for many statisticians; (2) May contain many similar trees, obscuring true patterns; (3) May perform poorly on small or low‑dimensional datasets, though it handles high‑dimensional, missing‑feature, and imbalanced data well.

machine learningPythonclassificationrandom forestscikit-learnensemble learningbagging
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.