The Mathematics of Decision Trees

A Decision Tree is one of the most fundamental algorithms in machine learning, used for both classification (predicting discrete labels) and regression (predicting continuous numbers). Unlike neural networks, which act as opaque "black boxes," decision trees are inherently interpretable. Every prediction can be traced down a transparent, boolean logic path from the root node to a terminal leaf.

However, computers don't magically know which questions to ask. The construction of a decision tree requires mathematically evaluating every possible feature split to find the one that best separates the data into pure, unambiguous categories. The core difference between various decision tree algorithms (ID3, CART, C4.5) lies in specifically what mathematical formula they use to define "purity."

1. Measuring "Impurity": Entropy vs. Gini

Before a tree can split a dataset, it needs to quantify how mixed (or impure) the dataset currently is. A dataset containing 50 "Yes" and 50 "No" results is highly impure. A dataset containing 100 "Yes" and 0 "No" results is perfectly pure.

ID3 Entropy & Information Gain

H(S) = - Σ (p_i * log₂(p_i))

Entropy originates from information theory. It measures the amount of "surprise" or disorder in a set. A perfect 50/50 split yields an Entropy of exactly 1.0. A perfectly pure set yields an Entropy of 0.0.

The ID3 algorithm calculates the Entropy of the entire dataset, then calculates the weighted Entropy after hypothetically splitting the data on a feature. The difference between the original Entropy and the new Entropy is the Information Gain. The algorithm chooses the feature that yields the absolute highest Information Gain.

CART Gini Impurity

Gini(S) = 1 - Σ (p_i)²

Gini Impurity measures the probability that a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. A perfect 50/50 split yields a maximum Gini of 0.5.

Gini Impurity is famously used by the standard CART (Classification and Regression Trees) algorithm implemented in libraries like Scikit-Learn. It is strictly computationally faster than Entropy because it does not require calculating resource-intensive logarithms, while producing virtually identical decision trees 98% of the time.

C4.5 Gain Ratio (The ID3 Correction)

The original ID3 algorithm had a fatal flaw: Information Gain heavily favors attributes with a massive number of distinct values (like a "User ID" or "Email" column). If you split a dataset by User ID, every single resulting leaf node contains exactly 1 user. The purity is perfect (Entropy = 0), so Information Gain is maximized. But a tree using User ID is totally useless for generalizing to new users.

The C4.5 algorithm fixed this by penalizing wide branches. It calculates Split Information (how broadly distributed the data is across the split) and divides Information Gain by this factor to yield a stabilized Gain Ratio.

2. The Curse of Overfitting

If left unconstrained, a Decision Tree algorithm will continue recursively splitting the data until every single leaf node is perfectly pure (contains only 1 sample type). In machine learning, this is called Overfitting. The tree perfectly memorizes the training data, including all the statistical noise, outliers, and errors, and completely fails to generalize to real-world data.

Pre-Pruning (Early Stopping)

Instead of letting the tree grow infinitely, hyperparameters are introduced to strictly halt construction early:

max_depth: The tree is forced to stop splitting after N levels.
min_samples_split: A node refuses to split if it contains fewer than N samples.
min_samples_leaf: A split is rejected if any of the resulting child leaves would contain fewer than N samples.

Post-Pruning (Cost Complexity Pruning)

The alternative is to let the tree overfit completely, and then algorithmically prune branches off from the bottom up. Cost Complexity Pruning introduces a penalty term `alpha` (α) that mathematically weighs the tree's accuracy against the total number of leaf nodes. Branches that do not provide enough accuracy improvement to justify their complexity are collapsed back into their parent.

3. Evolution: Ensemble Methods

Because a single distinct Decision Tree is highly sensitive to the exact data it was trained on (high variance), they are rarely used by themselves in modern production systems. Instead, they serve as the foundational building blocks for Ensemble Methods.

Random Forests (Bagging)

Instead of training one deep tree, a Random Forest trains hundreds of trees. Each tree is trained on a random subset of the data (bootstrapping), and at every split, it is only allowed to consider a random subset of the features. When making a prediction, the hundreds of trees all vote, and the majority wins. This drastically reduces the variance and risk of overfitting compared to a single tree.

Gradient Boosting Trees (XGBoost)

Instead of training hundreds of trees independently, Boosting trains trees sequentially. The first tree makes predictions (and mistakes). The second tree is explicitly trained specifically to predict the residual errors of the first tree. The third tree attempts to correct the errors of the second tree. This sequential gradient descent approach frequently dominates structured tabular data competitions on Kaggle.

Outlook	Temp	Humid	Wind	Play
Sun	Hot	Hig	No	No
Sun	Hot	Hig	Yes	No
Ove	Hot	Hig	No	Yes
Rai	Mil	Hig	No	Yes
Rai	Coo	Nor	No	Yes
Rai	Coo	Nor	Yes	No

Decision Tree Simulator

How to Use