Decision Trees: The Algorithm You Can Draw on Paper

Decision Trees: The Algorithm You Can Draw on Paper 🌳

Most machine learning algorithms are black boxes — you feed in data, something mathematical happens inside, and a prediction comes out. Decision trees are different. They are one of the few algorithms you can fully explain to a non-technical colleague, draw on a whiteboard, and still trust to make accurate predictions.

🎮 The 20 Questions Analogy

You've probably played 20 Questions: one person thinks of something, and others ask yes/no questions to narrow it down. "Is it alive? Is it bigger than a car? Does it live in water?" Each answer eliminates a huge swath of possibilities until the answer becomes obvious.

A decision tree works exactly like this. Given a new data point to classify, the tree asks a series of questions about its features, following the branches that match each answer, until it reaches a leaf — a final prediction.

A decision tree for classifying animals: first split on 'has wings?', then 'lives in water?', leading to leaf nodes with animal names — A decision tree asks a series of questions about features, narrowing down to a prediction at each leaf node.

🌿 Anatomy of a Tree

Before we get into how trees learn, let's name the parts:

Root node — the very top question; the most important feature
Internal nodes — questions at each branch point
Branches — the paths taken based on yes/no (or value-range) answers
Leaf nodes — the endpoints; each holds a final prediction

A single data point travels from root to leaf, answering one question at each node, until it reaches a prediction.

📐 How a Decision Tree Learns

The clever part: how does the algorithm decide which question to ask at each node? It tries every possible split on every feature and picks the one that best separates the data.

Information Gain and Gini Impurity

Two common measures of "best separation":

Gini impurity measures how mixed a group is. A perfectly pure node — all examples belong to one class — has a Gini impurity of 0. A completely mixed node has the maximum impurity. The algorithm prefers splits that produce the purest child nodes.

Information gain is similar: it measures how much a split reduces uncertainty (entropy) about the class label. Higher information gain = better split.

Both measures ask the same underlying question:

Lesson 15 of 160% complete

←Supervised vs Unsupervised Learning: Key Differences Explained

Discussion

Suggest an edit to this lesson

after splitting on this feature, how much more certain am I about the class?

🤯

The CART algorithm (Classification and Regression Trees), introduced in 1984 by Breiman, Friedman, Olshen, and Stone, is the foundation of most modern decision tree implementations. Despite being 40 years old, it remains one of the most widely used ML algorithms.

✂️ Overfitting and Pruning

Left unconstrained, a decision tree will grow until every training example has its own leaf — achieving 100% accuracy on training data but failing completely on new data. This is overfitting.

Imagine memorising every past exam question word-for-word instead of understanding the subject. You'd ace the past papers but fail the real exam.

Pre-pruning (early stopping) — set limits during training: maximum depth, minimum samples per leaf, minimum information gain threshold. The tree stops growing when it hits these limits.
Post-pruning — grow the full tree, then trim back branches that don't improve performance on a validation set.

🤔

Think about it:

A decision tree with depth 1 (a single question) is called a "decision stump". It's extremely simple — almost certainly underfitting. A tree of depth 100 with one sample per leaf is overfitting. How would you decide where to stop?

🌲 From Trees to Forests

A single decision tree is powerful but brittle — small changes in training data can produce very different trees. The solution: grow hundreds of trees, each trained on a random subset of the data and features, then average their predictions.

This is a Random Forest — one of the most reliable and widely-used algorithms in all of machine learning. You'll cover it in depth in a later lesson. For now, remember: individual trees are interpretable, forests are robust.

✅ Strengths and ⚠️ Weaknesses

| Strengths | Weaknesses | |---|---| | Fully interpretable — can be visualised | Prone to overfitting without pruning | | No need to normalise or scale features | Small data changes = very different trees | | Handles both numerical and categorical features | Biased towards features with more values | | Works without feature engineering | Not great at capturing linear relationships | | Fast to train and predict | Single trees often underperform ensembles |

AI Foundations

AI Mastery

Career Ready

Lab