Artificial Intelligence 18 min read

Master Feature Selection: From Filters to PCA with Python

This article explains why selecting the right features is essential for machine learning, outlines the general workflow, compares filter, wrapper, and embedded methods, demonstrates statistical tests and Python code examples, and shows how PCA can synthesize features for dimensionality reduction.

Model Perspective
Model Perspective
Model Perspective
Master Feature Selection: From Filters to PCA with Python

Choosing which features to include or discard is crucial in data processing and machine learning because irrelevant features cause redundancy and may degrade prediction accuracy.

General Feature‑Selection Workflow

The typical steps are:

Generate a subset of features to evaluate.

Define an evaluation function that scores a subset.

Set a stopping criterion (often a threshold on the evaluation score).

Validate the selected subset on a validation set.

Enumerating all subsets is infeasible when the number of features is large, so heuristic or domain‑specific strategies are required.

Common Feature‑Selection Strategies

Feature‑selection techniques can be grouped into three major families:

Filter methods : score each feature independently using statistical measures (e.g., correlation, chi‑square, mutual information) and keep the top‑k.

Wrapper methods : evaluate subsets by training a model and using its performance as the score.

Embedded methods : let the learning algorithm assign importance weights during training (e.g., L1‑regularised models, tree‑based importance).

Feature synthesis / dimensionality reduction such as Principal Component Analysis (PCA).

Filter Methods

Filter methods compute the amount of information a feature provides about the target and rank features accordingly. Two key decisions are the information‑measure and the selection threshold.

Evaluation Measures

Correlation coefficients (Pearson, Kendall) – assess linear relationship.

Chi‑square test – test independence between categorical variables.

Mutual information and maximal information coefficient – information‑theoretic measures.

Distance correlation.

Variance threshold – discard features with low variance.

Pearson Correlation Example

Using scipy.stats.pearsonr we compute the correlation between two synthetic features x1 , x2 and the target y :

<code>from scipy.stats import pearsonr
x1 = [51, 80, 95, 19, 73, 84, 65, 30, 1, 35, 13, 61, 36, 65, 57, 40, 15, 73, 58, 62]
x2 = [7.0, 27.5, 23.0, 32.0, 15.5, 44.0, 10.5, 29.5, 36.0, 47.5,
      27.0, 28.5, 26.5, 41.5, 12.5, 0.5, 19.0, 48.5, 0.5, 24.0]
y  = [14, 64, 54, 72, 36, 92, 24, 62, 72, 95, 55, 64, 60, 84,
      33, 5, 40, 99, 2, 48]
print(pearsonr(x1, y))
print(pearsonr(x2, y))
</code>

The output shows a near‑zero correlation for x1 and a very strong positive correlation for x2 . A scatter plot (see image below) visualises the relationship.

Linear correlation only captures linear relationships; non‑linear patterns may yield a Pearson coefficient close to zero, as illustrated by a quadratic example (image below).

Chi‑square Test

The chi‑square test compares observed frequencies with expected frequencies under independence. The typical steps are: formulate null and alternative hypotheses, compute expected counts, calculate the chi‑square statistic, determine degrees of freedom, define a rejection region, and decide based on the p‑value.

A classic example examines whether corrupt officials have shorter lifespans than honest officials using a 2 × 2 contingency table. The chi‑square statistic is 323.4 with a p‑value ≈ 2.6 × 10⁻⁷², leading to rejection of the independence hypothesis.

In Python the test can be performed with scipy.stats.chi2_contingency :

<code>import pandas as pd
df = pd.DataFrame([[348,152],[93,497]],
                  index=['Corrupt','Honest'],
                  columns=['Short','Long'])
from scipy.stats import chi2_contingency
chi2_contingency(df)
</code>

Wrapper Methods

Wrapper methods evaluate a subset by training a model (e.g., Random Forest, SVM, k‑NN) and measuring its error. Subset generation strategies include forward selection, backward elimination, and recursive feature elimination (RFE).

RFE with a Random Forest classifier on the breast‑cancer dataset can be implemented as:

<code>from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
selected = RFE(estimator=RandomForestClassifier(),
              n_features_to_select=5).fit_transform(X, y)
</code>

The resulting shape (569, 5) indicates that five most important features were retained.

Embedded Methods

L1‑Regularised Feature Selection

L1 regularisation (Lasso) yields sparse coefficients, effectively performing feature selection. Using logistic regression with an L1 penalty:

<code>from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

model = SelectFromModel(LogisticRegression(penalty='l1',
                                            solver='liblinear',
                                            C=0.25))
X_new = model.fit_transform(X, y)
print(X_new.shape)  # e.g., (569, 7)
</code>

The hyperparameter C controls the number of retained features.

Feature Synthesis – Principal Component Analysis (PCA)

PCA replaces a set of correlated features with a smaller set of orthogonal components that capture most of the variance. The algorithm steps are: standardise the data, compute the covariance matrix, obtain eigenvalues and eigenvectors, select the top‑k components based on explained variance, and transform the data.

Example using scikit‑learn:

<code>from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
import pandas as pd

cancer = load_breast_cancer()
X = cancer.data
pca = PCA(n_components=5)
pca.fit(X)
print('Explained variance:', pca.explained_variance_)
print('Explained variance ratio:', pca.explained_variance_ratio_)
</code>

The first component alone explains about 98 % of the variance, as shown in the scree plot (image below).

In summary, feature selection can be performed via filter, wrapper, embedded, or synthesis approaches, each with its own trade‑offs between computational cost and predictive power.

References

Sun Jiawei, “Feature Selection (Feature Selection) Methods Summary”, Zhihu.

Model Perspective, “PCA Comprehensive Evaluation with Python Code”, WeChat.

machine learningPythonPCAfeature selectionchi-squareembedded methodfilter methodwrapper method
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.