Artificial Intelligence 10 min read

Understanding RuleFit: Combining Tree Rules with Linear Models for Interpretable AI

RuleFit, introduced by Friedman and Popescu in 2008, integrates decision‑tree‑derived rules with linear regression to boost predictive accuracy while maintaining strong interpretability, and this article explains its definition, rule extraction, algorithmic implementation, code example, advantages, limitations, and practical insights.

Model Perspective
Model Perspective
Model Perspective
Understanding RuleFit: Combining Tree Rules with Linear Models for Interpretable AI

RuleFit Model

Friedman and Popescu proposed the RuleFit model in 2008. It first fits a tree model (e.g., decision tree, GBDT) to all features, extracts the rules from the tree, and treats those rules as new features for a linear model. This creates new features that capture interactions, improving accuracy over a plain linear model while preserving interpretability.

Model Definition

RuleFit incorporates tree‑derived rules into an ensemble learning model. Important rules are extracted from a complex model and used as additional variables in training. Because rules are inherently interpretable, RuleFit belongs to the class of intrinsically interpretable models. After fitting the ensemble tree model once, RuleFit adds rule variables (and optionally original linear features) to a second‑stage linear regression, yielding higher accuracy and an easy‑to‑understand structure.

There are two variants: a Rule‑Based Model and a Rule‑&‑Linear Based Model, the latter being an upgrade suitable for low‑signal‑to‑noise data. The objective functions are analogous to those of ensemble learning, with coefficients estimated by minimizing a regularized loss, similar to linear regression.

Rule Extraction

Rules are generated from the paths of a trained decision tree. Each path from the root to a non‑root node (interior or leaf) constitutes a rule. For example, a tree with nine nodes yields eight rules. The generic rule representation uses indicator functions that depend on the feature type (continuous or categorical).

Example: a rule from node 0 to node 5 involves three split nodes and can be expressed as an indicator that equals 1 when the sample satisfies all three conditions, otherwise 0.

Implementation Algorithm

RuleFit fits two models: first an ensemble of trees, then a generalized linear model using the extracted rule features. The tree ensemble is trained on a dataset of N samples with a specified loss function; after training M trees, all rules are extracted and used as new variables for the linear model, typically optimized with gradient descent.

Model Interpretability

After regularization, some coefficients become zero, performing variable selection. Importance can be measured globally and locally for both rule and linear features. Global importance combines the coefficient magnitude with the variable’s standard deviation; for rule variables, variance follows a binomial distribution based on the rule’s support in the data.

Local importance of original features is derived from the contributions of rules that contain the feature, with the total local importance summed across samples to obtain a global measure.

Code Example

<code>from sklearn.datasets import load_iris,load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split
from rulefit import RuleFit

data = load_breast_cancer()
df  = pd.DataFrame(data['data'],columns=data['feature_names'])
df['target'] = data['target']
X,y = df.iloc[:,:-1],df['target']
X = X.values

rf = RuleFit(tree_size=4,sample_fract=0.8,max_rules=20,memory_par=0.01,rfmode='classification')
rf.fit(X,y,feature_names = df.columns)
rules = rf.get_rules()
</code>

Advantages and Limitations

RuleFit can be seen as a generalization of the GBDT+LR approach used by Facebook in 2014. By converting tree paths into high‑dimensional sparse features, a simple logistic regression can achieve better performance, capturing low‑ and high‑order interactions while remaining interpretable.

Advantages include: (1) Logistic regression handles high‑dimensional sparse data and benefits from automatically discovered interaction terms; (2) Tree ensembles excel at partitioning continuous features, providing discriminative splits that improve model robustness when discretized.

Limitations include: (1) The final model may contain many rules, leading to reduced interpretability and multicollinearity when many similar rules overlap; (2) RuleFit’s performance can be unstable—different training runs on the same data may produce different rule sets and its predictive accuracy may lag behind more complex models.

feature engineeringlinear modelsensemble treesinterpretable machine learningRuleFit
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.