Master Decision Trees: Concepts, Algorithms, and Python Implementation
This article introduces decision trees in data analysis and machine learning, explains core concepts, compares popular algorithms such as ID3, C4.5, and CART, details their mathematical foundations, and provides a complete Python implementation with code examples for building and visualizing trees.
Introduction
Decision trees are a type of machine learning model that classify data by asking a series of feature‑based questions, forming a tree structure where each leaf represents a class.
What is a decision tree?
Simply put, a decision tree uses features to ask questions at each node, splitting the data into subsets until reaching leaf nodes that correspond to class labels.
Common algorithms
Popular decision‑tree algorithms include ID3, C4.5, and CART.
ID3 : selects the feature with the highest information‑gain (entropy reduction) as the node.
C4.5 : an improvement over ID3, offering higher accuracy, handling continuous attributes and missing values.
CART : uses the Gini impurity as split criterion and can handle outliers and missing values.
Mathematical principles
ID3
ID3 uses entropy E(S) to measure disorder; information gain is the reduction in entropy after a split. The goal is to minimize entropy.
Gain quantifies how much entropy is reduced when splitting on a particular feature.
Example: calculating entropy for each feature and selecting the one with the highest gain (e.g., Outlook).
C4.5
C4.5 addresses ID3’s sensitivity to many‑value features by using the gain ratio (information gain divided by split information). It also handles continuous attributes and missing values.
For continuous attributes (e.g., humidity), values are sorted, duplicate values removed, and split points evaluated; the split with the highest gain is chosen.
For missing values, a specific formula adjusts the information gain to account for unknown entries.
Comparison of ID3 and C4.5 shows differences in accuracy and execution time.
Python implementation (C4.5)
Below is a concise Python implementation of the C4.5 algorithm.
def createDataSet():
dataSet = [[0,0,0,0,'N'],
[0,0,0,1,'N'],
[1,0,0,0,'Y'],
[2,1,0,0,'Y'],
[2,2,1,0,'Y'],
[2,2,1,1,'N'],
[1,2,1,1,'Y']]
labels = ['outlook','temperature','humidity','windy']
return dataSet, labels
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]
labelCounts[currentLabel] = labelCounts.get(currentLabel, 0) + 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key]) / numEntries
shannonEnt -= prob * log(prob, 2)
return shannonEnt
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGainRatio = 0.0
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0.0
splitInfo = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet) / float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
splitInfo += -prob * log(prob, 2)
infoGain = baseEntropy - newEntropy
if splitInfo == 0:
continue
infoGainRatio = infoGain / splitInfo
if infoGainRatio > bestInfoGainRatio:
bestInfoGainRatio = infoGainRatio
bestFeature = i
return bestFeature
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis] + featVec[axis+1:]
retDataSet.append(reducedFeatVec)
return retDataSet
def createTree(dataSet, labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataSet[0]) == 1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel: {}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
return myTreeAfter constructing the tree, it can be visualized with a plotting utility.
Conclusion
Decision trees are a widely used machine‑learning technique that map object attributes to class labels, providing interpretable models for classification and prediction; they can be extended to handle multiple outputs by building separate trees for each target.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.