What is the terminology for removing branches from a decision tree to avoid overfitting the model?

Costs. Benefits. Probabilities. For data analysts (or just human beings!) these concepts are key to our daily decision-making. Realize it or not, you’re constantly weighing up each one; from deciding which brand of detergent to buy, to the best plan of action for your business. For those working in data analytics and machine learning, we can formalize this thinking process into an algorithm known as a ‘decision tree.’

Table of Contents Show

1. What is a decision tree?
2. What are the different parts of a decision tree?
Internal nodes
3. An example of a simple decision tree
4. Pros and cons of decision trees
Advantages of decision trees
Disadvantages of decision trees
5. What are decision trees used for?
1. Introduction
2. Various Decision Tree Algorithms
3. Decision Tree Terminology
4. Decision Tree Intuition
5. Splitting in Decision Trees
5.1 Gini Impurity
5.2 Entropy
5.3 Misclassification Impurity
6. Advantages and Disadvantages of Trees
Continue Learning — Splitting Process in Decision Trees

But what exactly is a decision tree? This post provides a short introduction to the concept of decision trees, how they work, and how you can use them to sort complex data in a logical, visual way. Whether you’re a newly-qualified data analyst or just curious about the field, by the end of this post you should be well-placed to explore the concept in more depth.

We’ll cover the following:

1. What is a decision tree?

In its simplest form, a decision tree is a type of flowchart that shows a clear pathway to a decision. In terms of data analytics, it is a type of algorithm that includes conditional ‘control’ statements to classify data. A decision tree starts at a single point (or ‘node’) which then branches (or ‘splits’) in two or more directions. Each branch offers different possible outcomes, incorporating a variety of decisions and chance events until a final outcome is achieved. When shown visually, their appearance is tree-like…hence the name!

Decision trees are extremely useful for data analytics and machine learning because they break down complex data into more manageable parts. They’re often used in these fields for prediction analysis, data classification, and regression. Don’t worry if this all sounds a bit abstract—we’ll provide some examples below to help clear things up. First though, let’s look at the different aspects that make up a decision tree.

2. What are the different parts of a decision tree?

Decision trees can deal with complex data, which is part of what makes them useful. However, this doesn’t mean that they are difficult to understand. At their core, all decision trees ultimately consist of just three key parts, or ‘nodes’:

Decision nodes: Representing a decision (typically shown with a square)
Chance nodes: Representing probability or uncertainty (typically denoted by a circle)
End nodes: Representing an outcome (typically shown with a triangle)

Connecting these different nodes are what we call ‘branches’. Nodes and branches can be used over and over again in any number of combinations to create trees of various complexity. Let’s see how these parts look before we add any data.

Luckily, a lot of decision tree terminology follows the tree analogy, which makes it much easier to remember! Some other terms you might come across will include:

Root nodes

In the diagram above, the blue decision node is what we call a ‘root node.’ This is always the first node in the path. It is the node from which all other decision, chance, and end nodes eventually branch.

Leaf nodes

In the diagram above, the lilac end nodes are what we call ‘leaf nodes.’ These show the end of a decision path (or outcome). You can always identify a leaf node because it doesn’t split, or branch any further. Just like a real leaf!

Internal nodes

Between the root node and the leaf nodes, we can have any number of internal nodes. These can include decisions and chance nodes (for simplicity, this diagram only uses chance nodes). It’s easy to identify an internal node—each one has branches of its own while also connecting to a previous node.

Splitting

Branching or ‘splitting’ is what we call it when any node divides into two or more sub-nodes. These sub-nodes can be another internal node, or they can lead to an outcome (a leaf/ end node.)

Pruning

Sometimes decision trees can grow quite complex. In these cases, they can end up giving too much weight to irrelevant data. To avoid this problem, we can remove certain nodes using a process known as ‘pruning’. Pruning is exactly what it sounds like—if the tree grows branches we don’t need, we simply cut them off. Easy!

3. An example of a simple decision tree

Now that we’ve covered the basics, let’s see how a decision tree might look. We’ll keep it really simple. Let’s say that we’re trying to classify what options are available to us if we are hungry. We might show this as follows:

In this diagram, our different options are laid out in a clear, visual way. Decision nodes are navy blue, chance nodes are light blue, and end nodes are purple. It is easy for anybody to understand and to see the possible outcomes.

However, let’s not forget: our aim was to classify what to do in the event of being hungry. By including options for what to do in the event of not being hungry, we’ve overcomplicated our decision tree. Cluttering a tree in this way is a common problem, especially when dealing with large amounts of data. It often results in the algorithm extracting meaning from irrelevant information. This is known as overfitting. One option to fix overfitting is simply to prune the tree:

As you can see, the focus of our decision tree is now much clearer. By removing the irrelevant information (i.e. what to do if we’re not hungry) our outcomes are focused on the goal we’re aiming for. This is one example of a pitfall that decision trees can fall into, and how to get around it. However, there are several pros and cons for decision trees. Let’s touch on these next.

4. Pros and cons of decision trees

Used effectively, decision trees are very powerful tools. Nevertheless, like any algorithm, they’re not suited to every situation. Here are some key advantages and disadvantages of decision trees.

Advantages of decision trees

Good for interpreting data in a highly visual way.
Good for handling a combination of numerical and non-numerical data.
Easy to define rules, e.g. ‘yes, no, if, then, else…’
Requires minimal preparation or data cleaning before use.
Great way to choose between best, worst, and likely case scenarios.
Can be easily combined with other decision-making techniques.

Disadvantages of decision trees

Overfitting (where a model interprets meaning from irrelevant data) can become a problem if a decision tree’s design is too complex.
They are not well-suited to continuous variables (i.e. variables which can have more than one value, or a spectrum of values).
In predictive analysis, calculations can quickly grow cumbersome, especially when a decision path includes many chance variables.
When using an imbalanced dataset (i.e. where one class of data dominates over another) it is easy for outcomes to be biased in favor of the dominant class.
Generally, decision trees provide lower prediction accuracy compared to other predictive algorithms.

5. What are decision trees used for?

Despite their drawbacks, decision trees are still a powerful and popular tool. They’re commonly used by data analysts to carry out predictive analysis (e.g. to develop operations strategies in businesses). They’re also a popular tool for machine learning and artificial intelligence, where they’re used as training algorithms for supervised learning (i.e. categorizing data based on different tests, such as ‘yes’ or ‘no’ classifiers.)

Broadly, decision trees are used in a wide range of industries, to solve many types of problems. Because of their flexibility, they’re used in sectors from technology and health to financial planning. Examples include:

A technology business evaluating expansion opportunities based on analysis of past sales data.
A toy company deciding where to target its limited advertising budget, based on what demographic data suggests customers are likely to buy.
Banks and mortgage providers using historical data to predict how likely it is that a borrower will default on their payments.
Emergency room triage might use decision trees to prioritize patient care (based on factors such as age, gender, symptoms, etc.)
Automated telephone systems guiding you to the outcome you need, e.g. ‘For option A, press 1. For option B, press 2’, and so on.

As you can see, there many uses for decision trees!

Decision trees are straightforward to understand, yet excellent for complex datasets. This makes them a highly versatile tool. Let’s summarize:

Decision trees are composed of three main parts—decision nodes (denoting choice), chance nodes (denoting probability), and end nodes (denoting outcomes).
Decision trees can be used to deal with complex datasets, and can be pruned if necessary to avoid overfitting.
Despite having many benefits, decision trees are not suited to all types of data, e.g. continuous variables or imbalanced datasets.
They are popular in data analytics and machine learning, with practical applications across sectors from health, to finance, and technology.

Now you’ve learned the basics, you’re ready to explore this versatile algorithm in more depth. We’ll cover some useful applications of decision trees in more detail in future posts.

New to data analytics? Get a hands-on introduction to the field with this free data analytics short course, and check out some more tools of the trade:

Decision Tree Algorithms - Part 1

1. Introduction

Decision Trees is the non-parametric supervised learning approach, and can be applied to both regression and classification problems. In keeping with the tree analogy, decision trees implement a sequential decision process. Starting from the root node, a feature is evaluated and one of the two nodes (branches) is selected, Each node in the tree is basically a decision rule. This procedure is repeated until a final leaf is reached, which normally represents the target. Decision trees are also attractive models if we care about interpretability.

2. Various Decision Tree Algorithms

There are algorithms for creating decision trees :

ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the tree to generalise to unseen data [1].
C4.5 was developed in 1993 by Ross Quinlan, is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rule’s precondition if the accuracy of the rule improves without it [1].
C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate [1].
CART (Classification and Regression trees) is very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node [1].

scikit-learn uses an optimised version of the CART algorithm

3. Decision Tree Terminology

In keeping with the tree analogy, the terminology was adopted from the terminology of the tree [2].

Root node : is the first node in decision trees
Splitting : is a process of dividing node into two or more sub-nodes, starting from the root node
Node : splitting results from the root node into sub-nodes and splitting sub-nodes into further sub-nodes
Leaf or terminal node : end of a node, since node cannot be split anymore
Pruning : is a technique to reduce the size of the decision tree by removing sub-nodes of the decision tree. The aim is reducing complexity for improved predictive accuracy and to avoid overfitting
Branch / Sub-Tree : A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.

4. Decision Tree Intuition

Let’s consider the following example where a decision tree to decide tree to predict an salary on a baseball player case :

We use the Hitters data set to predict a baseball player’s Salary (mean log salary) based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year).

Based on the features, the decision tree model learns a series of splitting rules, starting at the top of the tree (root node).

The root node split into sub-node with observation rule having Years <4.5 to the left branch, which means the players in dataset with Years<4.5 having mean log salary is 5.107 and we make a prediction of e5.107 thousands of dollars, i.e. $165,174 for these players
Players with Years>=4.5 are assigned to the right branch and then that group is further subdivided by Hits < 177.5 having mean los salary of 6.
Players with Years>=4.5 are assigned to the right branch and then that group is further subdivided by Hits >= 177.5 having mean los salary of 6.74

In this case, it can be seen that the decision tree makes a segment into three regions where this region determines the salaries of baseball players and it can be said that the region is a decision boundary [3].

These three regions can be written as

R1 ={X | Years<4.5 }
R2 ={X | Years>=4.5 ,Hits<117.5 }
R3 ={X | Years>=4.5 , Hits>=117.5 }.

From this intuition there is a process, how a decision tree splitting features to form a region that can predict salary of baseball players. This process will be explained in more detail in the next article (Decision Tree Algorithms - Part 2).

5. Splitting in Decision Trees

In order to split the nodes at the most informative features using the decision algorithm, we start at the tree root and split the data on the feature that results in the largest information gain (IG). Here, the objective function is to maximize the information gain (IG) at each split, which we define as follows:

f is the feature to perform the split, Dp and Dj are data set of the parent, j-th child node, I is our impurity measure, Np is the total number of samples at the parent node, and Nj is the number of samples in the j-th child node.

As we can see, the information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities — the lower the impurity of the child nodes, the larger the information gain. however, for simplicity and to reduce the combinatorial search space, most libraries (including scikit-learn) implement binary decision trees. This means that each parent node is split into two child nodes, D-left and D-right.

impurity measure implements binary decisions trees and the three impurity measures or splitting criteria that are commonly used in binary decision trees are Gini impurity (IG), entropy (IH), and misclassification error (IE) [4]

5.1 Gini Impurity

According to Wikipedia [5],

Used by the CART (classification and regression tree) algorithm for classification trees, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

Mathematically, we can write Gini Impurity as following

where j is the number of classes present in the node and p is the distribution of the class in the node.

Simple simulation with Heart Disease Data set with 303 rows and has 13 attributes. Target consist 138 value 0 and 165 value 1

In order to build a decision tree from the dataset and to determine which separation is best, we need a way to measure and compare Gini Impurity in each attribute. The lowest Gini Impurity value on the first iteration will be the Root Node. we can write equation 3 as :

In this simulation, only use the sex, fbs (fasting blood sugar), exang (exercise induced angina), and target attributes.

How to measure Gini Impurity in Sex attribute

Gini Impurity — Left Node

Gini Impurity — Right Node

Now that we have measured the Gini Impurity for both leaf nodes. We can calculate the total Gini Impurity with weight average. Left Node represented 138 patient while Right Node represented 165 patient

Total Gini Impurity — Leaf Node

How to measure Gini Impurity in Fbs (fasting blood sugar) attribute

Gini Impurity — Left Node

Gini Impurity — Right Node

Total Gini Impurity — Leaf Node

How to measure Gini Impurity in Exang (exercise induced angina) attribute

Gini Impurity — Left Node

Gini Impurity — Right Node

Total Gini Impurity — Leaf Node

Fbs (fasting blood sugar) has the lowest Gini Impurity, so well use it at the Root Node

5.2 Entropy

Used by the ID3, C4.5 and C5.0 tree-generation algorithms. Information gain is based on the concept of entropy, the entropy measure is defined as [5]:

where j is the number of classes present in the node and p is the distribution of the class in the node.

In the same case and same data set, we need a way to measure and compare Entropy in each attribute. The highest Entropy value on the first iteration will be the Root Node.

We need calculate entropy in Target attribute first

How to measure Entropy in Sex attribute

Entropy — Sex = 0

Entropy — Sex = 1

Now that we have measured the Entropy for both leaf nodes. We take the weight average again to calculate the total entropy value.

Entropy — Sex

How to measure Entropy in Fbs attribute

Entropy — Fbs = 0

Entropy — Fbs = 1

Entropy — Fbs

How to measure Entropy in Exang attribute

Entropy — Exang = 0

Entropy — Exang = 1

Entropy — Exang

Fbs (fasting blood sugar) has the highest gini impurity, so we will use it at the Root Node, Precisely the same results we got from Gini Impurity.

5.3 Misclassification Impurity

Another impurity measure is the misclassification impurity , Mathematically, we can write misclassification impurity as following

In terms of quality performance, this index is not the best choice because it’s not particularly sensitive to different probability distributions (which can easily drive the selection to a subdivision using Gini or entropy)[6].

6. Advantages and Disadvantages of Trees

Advantages

Trees are very easy to explain to people. In fact, they are even easier
to explain than linear regression!
Some people believe that decision trees more closely mirror human
decision-making than do the regression and classification approaches
seen in previous chapters.
Trees can be displayed graphically, and are easily interpreted even by
a non-expert (especially if they are small).
Trees can easily handle qualitative predictors without the need to
create dummy variables[3].

Disadvantages