AI EducademyAIEducademy
๐ŸŒณ

AI Foundations

๐ŸŒฑ
AI Seeds

Start from zero

๐ŸŒฟ
AI Sprouts

Build foundations

๐ŸŒณ
AI Branches

Apply in practice

๐Ÿ•๏ธ
AI Canopy

Go deep

๐ŸŒฒ
AI Forest

Master AI

๐Ÿ”จ

AI Mastery

โœ๏ธ
AI Sketch

Start from zero

๐Ÿชจ
AI Chisel

Build foundations

โš’๏ธ
AI Craft

Apply in practice

๐Ÿ’Ž
AI Polish

Go deep

๐Ÿ†
AI Masterpiece

Master AI

๐Ÿš€

Career Ready

๐Ÿš€
Interview Launchpad

Start your journey

๐ŸŒŸ
Behavioral Mastery

Master soft skills

๐Ÿ’ป
Technical Interviews

Ace the coding round

๐Ÿค–
AI & ML Interviews

ML interview mastery

๐Ÿ†
Offer & Beyond

Land the best offer

View All Programsโ†’

Lab

7 experiments loaded
๐Ÿง Neural Network Playground๐Ÿค–AI or Human?๐Ÿ’ฌPrompt Lab๐ŸŽจImage Generator๐Ÿ˜ŠSentiment Analyzer๐Ÿ’กChatbot Builderโš–๏ธEthics Simulator
๐ŸŽฏMock InterviewEnter the Labโ†’
JourneyBlog
๐ŸŽฏ
About

Making AI education accessible to everyone, everywhere

โ“
FAQ

Common questions answered

โœ‰๏ธ
Contact

Get in touch with us

โญ
Open Source

Built in public on GitHub

Get Started
AI EducademyAIEducademy

MIT Licence. Open Source

Learn

  • Academics
  • Lessons
  • Lab

Community

  • GitHub
  • Contribute
  • Code of Conduct
  • About
  • FAQ

Support

  • Buy Me a Coffee โ˜•
  • Terms of Service
  • Privacy Policy
  • Contact
AI & Engineering Academicsโ€บ๐ŸŒฟ AI Sproutsโ€บLessonsโ€บDecision Trees: The Algorithm You Can Draw on Paper
๐ŸŒณ
AI Sprouts โ€ข Intermediateโฑ๏ธ 25 min read

Decision Trees: The Algorithm You Can Draw on Paper

Decision Trees: The Algorithm You Can Draw on Paper ๐ŸŒณ

Most machine learning algorithms are black boxes โ€” you feed in data, something mathematical happens inside, and a prediction comes out. Decision trees are different. They are one of the few algorithms you can fully explain to a non-technical colleague, draw on a whiteboard, and still trust to make accurate predictions.


๐ŸŽฎ The 20 Questions Analogy

You've probably played 20 Questions: one person thinks of something, and others ask yes/no questions to narrow it down. "Is it alive? Is it bigger than a car? Does it live in water?" Each answer eliminates a huge swath of possibilities until the answer becomes obvious.

A decision tree works exactly like this. Given a new data point to classify, the tree asks a series of questions about its features, following the branches that match each answer, until it reaches a leaf โ€” a final prediction.

A decision tree for classifying animals: first split on 'has wings?', then 'lives in water?', leading to leaf nodes with animal names
A decision tree asks a series of questions about features, narrowing down to a prediction at each leaf node.

๐ŸŒฟ Anatomy of a Tree

Before we get into how trees learn, let's name the parts:

  • Root node โ€” the very top question; the most important feature
  • Internal nodes โ€” questions at each branch point
  • Branches โ€” the paths taken based on yes/no (or value-range) answers
  • Leaf nodes โ€” the endpoints; each holds a final prediction

A single data point travels from root to leaf, answering one question at each node, until it reaches a prediction.


๐Ÿ“ How a Decision Tree Learns

The clever part: how does the algorithm decide which question to ask at each node? It tries every possible split on every feature and picks the one that best separates the data.

Information Gain and Gini Impurity

Two common measures of "best separation":

Gini impurity measures how mixed a group is. A perfectly pure node โ€” all examples belong to one class โ€” has a Gini impurity of 0. A completely mixed node has the maximum impurity. The algorithm prefers splits that produce the purest child nodes.

Information gain is similar: it measures how much a split reduces uncertainty (entropy) about the class label. Higher information gain = better split.

Both measures ask the same underlying question:

Lesson 15 of 160% complete
โ†Supervised vs Unsupervised Learning: Key Differences Explained

Discussion

Sign in to join the discussion

Suggest an edit to this lesson
after splitting on this feature, how much more certain am I about the class?
๐Ÿคฏ

The CART algorithm (Classification and Regression Trees), introduced in 1984 by Breiman, Friedman, Olshen, and Stone, is the foundation of most modern decision tree implementations. Despite being 40 years old, it remains one of the most widely used ML algorithms.


โœ‚๏ธ Overfitting and Pruning

Left unconstrained, a decision tree will grow until every training example has its own leaf โ€” achieving 100% accuracy on training data but failing completely on new data. This is overfitting.

Imagine memorising every past exam question word-for-word instead of understanding the subject. You'd ace the past papers but fail the real exam.

Two main remedies:

  1. Pre-pruning (early stopping) โ€” set limits during training: maximum depth, minimum samples per leaf, minimum information gain threshold. The tree stops growing when it hits these limits.

  2. Post-pruning โ€” grow the full tree, then trim back branches that don't improve performance on a validation set.

๐Ÿค”
Think about it:

A decision tree with depth 1 (a single question) is called a "decision stump". It's extremely simple โ€” almost certainly underfitting. A tree of depth 100 with one sample per leaf is overfitting. How would you decide where to stop?


๐ŸŒฒ From Trees to Forests

A single decision tree is powerful but brittle โ€” small changes in training data can produce very different trees. The solution: grow hundreds of trees, each trained on a random subset of the data and features, then average their predictions.

This is a Random Forest โ€” one of the most reliable and widely-used algorithms in all of machine learning. You'll cover it in depth in a later lesson. For now, remember: individual trees are interpretable, forests are robust.


โœ… Strengths and โš ๏ธ Weaknesses

| Strengths | Weaknesses | |---|---| | Fully interpretable โ€” can be visualised | Prone to overfitting without pruning | | No need to normalise or scale features | Small data changes = very different trees | | Handles both numerical and categorical features | Biased towards features with more values | | Works without feature engineering | Not great at capturing linear relationships | | Fast to train and predict | Single trees often underperform ensembles |