Costs. Benefits. Probabilities. For data analysts (or just human beings!) these concepts are key to our daily decision-making. Realize it or not, you’re constantly weighing up each one; from deciding which brand of detergent to buy, to the best plan of action for your business. For those working in data analytics and machine learning, we can formalize this thinking process into an algorithm known as a ‘decision tree.’ Show
But what exactly is a decision tree? This post provides a short introduction to the concept of decision trees, how they work, and how you can use them to sort complex data in a logical, visual way. Whether you’re a newly-qualified data analyst or just curious about the field, by the end of this post you should be well-placed to explore the concept in more depth. We’ll cover the following: 1. What is a decision tree?In its simplest form, a decision tree is a type of flowchart that shows a clear pathway to a decision. In terms of data analytics, it is a type of algorithm that includes conditional ‘control’ statements to classify data. A decision tree starts at a single point (or ‘node’) which then branches (or ‘splits’) in two or more directions. Each branch offers different possible outcomes, incorporating a variety of decisions and chance events until a final outcome is achieved. When shown visually, their appearance is tree-like…hence the name! Decision trees are extremely useful for data analytics and machine learning because they break down complex data into more manageable parts. They’re often used in these fields for prediction analysis, data classification, and regression. Don’t worry if this all sounds a bit abstract—we’ll provide some examples below to help clear things up. First though, let’s look at the different aspects that make up a decision tree. 2. What are the different parts of a decision tree?Decision trees can deal with complex data, which is part of what makes them useful. However, this doesn’t mean that they are difficult to understand. At their core, all decision trees ultimately consist of just three key parts, or ‘nodes’:
Connecting these different nodes are what we call ‘branches’. Nodes and branches can be used over and over again in any number of combinations to create trees of various complexity. Let’s see how these parts look before we add any data. Luckily, a lot of decision tree terminology follows the tree analogy, which makes it much easier to remember! Some other terms you might come across will include: Root nodesIn the diagram above, the blue decision node is what we call a ‘root node.’ This is always the first node in the path. It is the node from which all other decision, chance, and end nodes eventually branch. Leaf nodesIn the diagram above, the lilac end nodes are what we call ‘leaf nodes.’ These show the end of a decision path (or outcome). You can always identify a leaf node because it doesn’t split, or branch any further. Just like a real leaf! Internal nodesBetween the root node and the leaf nodes, we can have any number of internal nodes. These can include decisions and chance nodes (for simplicity, this diagram only uses chance nodes). It’s easy to identify an internal node—each one has branches of its own while also connecting to a previous node. SplittingBranching or ‘splitting’ is what we call it when any node divides into two or more sub-nodes. These sub-nodes can be another internal node, or they can lead to an outcome (a leaf/ end node.) PruningSometimes decision trees can grow quite complex. In these cases, they can end up giving too much weight to irrelevant data. To avoid this problem, we can remove certain nodes using a process known as ‘pruning’. Pruning is exactly what it sounds like—if the tree grows branches we don’t need, we simply cut them off. Easy! 3. An example of a simple decision treeNow that we’ve covered the basics, let’s see how a decision tree might look. We’ll keep it really simple. Let’s say that we’re trying to classify what options are available to us if we are hungry. We might show this as follows: In this diagram, our different options are laid out in a clear, visual way. Decision nodes are navy blue, chance nodes are light blue, and end nodes are purple. It is easy for anybody to understand and to see the possible outcomes. However, let’s not forget: our aim was to classify what to do in the event of being hungry. By including options for what to do in the event of not being hungry, we’ve overcomplicated our decision tree. Cluttering a tree in this way is a common problem, especially when dealing with large amounts of data. It often results in the algorithm extracting meaning from irrelevant information. This is known as overfitting. One option to fix overfitting is simply to prune the tree: As you can see, the focus of our decision tree is now much clearer. By removing the irrelevant information (i.e. what to do if we’re not hungry) our outcomes are focused on the goal we’re aiming for. This is one example of a pitfall that decision trees can fall into, and how to get around it. However, there are several pros and cons for decision trees. Let’s touch on these next. 4. Pros and cons of decision treesUsed effectively, decision trees are very powerful tools. Nevertheless, like any algorithm, they’re not suited to every situation. Here are some key advantages and disadvantages of decision trees. Advantages of decision trees
Disadvantages of decision trees
5. What are decision trees used for?Despite their drawbacks, decision trees are still a powerful and popular tool. They’re commonly used by data analysts to carry out predictive analysis (e.g. to develop operations strategies in businesses). They’re also a popular tool for machine learning and artificial intelligence, where they’re used as training algorithms for supervised learning (i.e. categorizing data based on different tests, such as ‘yes’ or ‘no’ classifiers.) Broadly, decision trees are used in a wide range of industries, to solve many types of problems. Because of their flexibility, they’re used in sectors from technology and health to financial planning. Examples include:
As you can see, there many uses for decision trees! Decision trees are straightforward to understand, yet excellent for complex datasets. This makes them a highly versatile tool. Let’s summarize:
Now you’ve learned the basics, you’re ready to explore this versatile algorithm in more depth. We’ll cover some useful applications of decision trees in more detail in future posts. New to data analytics? Get a hands-on introduction to the field with this free data analytics short course, and check out some more tools of the trade: Decision Tree Algorithms - Part 1 Decision Trees is the non-parametric supervised learning approach, and can be applied to both regression and classification problems. In keeping with the tree analogy, decision trees implement a sequential decision process. Starting from the root node, a feature is evaluated and one of the two nodes (branches) is selected, Each node in the tree is basically a decision rule. This procedure is repeated until a final leaf is reached, which normally represents the target. Decision trees are also attractive models if we care about interpretability. There are algorithms for creating decision trees : scikit-learn uses an optimised version of the CART algorithm In keeping with the tree analogy, the terminology was adopted from the terminology of the tree [2]. Let’s consider the following example where a decision tree to decide tree to predict an salary on a baseball player case : We use the Hitters data set to predict a baseball player’s Salary (mean log salary) based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). Based on the features, the decision tree model learns a series of splitting rules, starting at the top of the tree (root node). In this case, it can be seen that the decision tree makes a segment into three regions where this region determines the salaries of baseball players and it can be said that the region is a decision boundary [3]. These three regions can be written as From this intuition there is a process, how a decision tree splitting features to form a region that can predict salary of baseball players. This process will be explained in more detail in the next article (Decision Tree Algorithms - Part 2). In order to split the nodes at the most informative features using the decision algorithm, we start at the tree root and split the data on the feature that results in the largest information gain (IG). Here, the objective function is to maximize the information gain (IG) at each split, which we define as follows: f is the feature to perform the split, Dp and Dj are data set of the parent, j-th child node, I is our impurity measure, Np is the total number of samples at the parent node, and Nj is the number of samples in the j-th child node. As we can see, the information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities — the lower the impurity of the child nodes, the larger the information gain. however, for simplicity and to reduce the combinatorial search space, most libraries (including scikit-learn) implement binary decision trees. This means that each parent node is split into two child nodes, D-left and D-right. impurity measure implements binary decisions trees and the three impurity measures or splitting criteria that are commonly used in binary decision trees are Gini impurity (IG), entropy (IH), and misclassification error (IE) [4] According to Wikipedia [5], Used by the CART (classification and regression tree) algorithm for classification trees, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Mathematically, we can write Gini Impurity as following where j is the number of classes present in the node and p is the distribution of the class in the node. Simple simulation with Heart Disease Data set with 303 rows and has 13 attributes. Target consist 138 value 0 and 165 value 1 In order to build a decision tree from the dataset and to determine which separation is best, we need a way to measure and compare Gini Impurity in each attribute. The lowest Gini Impurity value on the first iteration will be the Root Node. we can write equation 3 as : In this simulation, only use the sex, fbs (fasting blood sugar), exang (exercise induced angina), and target attributes. How to measure Gini Impurity in Sex attribute Gini Impurity — Left Node Gini Impurity — Right Node Now that we have measured the Gini Impurity for both leaf nodes. We can calculate the total Gini Impurity with weight average. Left Node represented 138 patient while Right Node represented 165 patient Total Gini Impurity — Leaf Node How to measure Gini Impurity in Fbs (fasting blood sugar) attribute Gini Impurity — Left Node Gini Impurity — Right Node Total Gini Impurity — Leaf Node How to measure Gini Impurity in Exang (exercise induced angina) attribute Gini Impurity — Left Node Gini Impurity — Right Node Total Gini Impurity — Leaf Node Fbs (fasting blood sugar) has the lowest Gini Impurity, so well use it at the Root Node 5.2 EntropyUsed by the ID3, C4.5 and C5.0 tree-generation algorithms. Information gain is based on the concept of entropy, the entropy measure is defined as [5]: where j is the number of classes present in the node and p is the distribution of the class in the node. In the same case and same data set, we need a way to measure and compare Entropy in each attribute. The highest Entropy value on the first iteration will be the Root Node. We need calculate entropy in Target attribute first How to measure Entropy in Sex attribute Entropy — Sex = 0 Entropy — Sex = 1 Now that we have measured the Entropy for both leaf nodes. We take the weight average again to calculate the total entropy value. Entropy — Sex How to measure Entropy in Fbs attribute Entropy — Fbs = 0 Entropy — Fbs = 1 Entropy — Fbs How to measure Entropy in Exang attribute Entropy — Exang = 0 Entropy — Exang = 1 Entropy — Exang Fbs (fasting blood sugar) has the highest gini impurity, so we will use it at the Root Node, Precisely the same results we got from Gini Impurity. 5.3 Misclassification ImpurityAnother impurity measure is the misclassification impurity , Mathematically, we can write misclassification impurity as following In terms of quality performance, this index is not the best choice because it’s not particularly sensitive to different probability distributions (which can easily drive the selection to a subdivision using Gini or entropy)[6]. 6. Advantages and Disadvantages of TreesAdvantages
Disadvantages
Continue Learning — Splitting Process in Decision Trees*The scikit-learn package implements the CART as its default decision tree. About MeI’m a Data Scientist, Focus on Machine Learning and Deep Learning. You can reach me from Medium and Linkedin My Website : https://komuternak.com/ Reference |