**Gradient Boosting to the Xtreme – Part I**

by **Rob Chin** | January 23, 2017

A key element of Rubikloud’s philosophy around software is that machine learning should be embedded into business software, not necessarily to replace human intuition, but rather, to augment and enhance it. The reasons for why we believe that to be true, you can refer to Kerry’s posts here:

HOW TO USE MACHINE LEARNING TO FURTHER RETAIL ANALYTIC CAPACITY

AN INSIDER’S VIEW OF RETAIL: PART ONE

From a data science perspective, the challenges that we face at Rubikloud on a day to day basis heavily involve modelling user behaviour. A simplified example as a retailer would be: If I gave an offer to a customer, what is the likelihood that the customer will enter the store? Use the offer? Buy a particular product? Typically, examples like this can be modelled as a classification problem where given attributes, or features, of a customer, predict a certain kind of behavior. There are a wide range of machine learning algorithms to solve this problem. In this blog series, I will aim to discuss one particular machine learning algorithm, gradient boosted trees, and in particular discuss an implementation called extreme gradient boost [1] or XGBoost [2], which has become a favourite among data science competitions.

The series will be split into 3 blog posts on our journey to understanding XGBoost with each component forming the basis of our understanding for the next topic, which will culminate in our final post on the XGBoost algorithm!

**A Decision Tree Classifier**

The decision tree classifier can be described as one of the simplest yet widely used techniques available to a data scientist when it comes to supervised learning problems. One of the reasons why decision trees are so popular is that they are both easy to understand and interpret. Let’s take a look at a simple example to see why:

A tree is a structure that recursively splits the data based on some criteria (the features). The data is typically in tabular data as shown in Table 1. Taking an example of physical attributes of people we can construct a simple dataset that easily lends itself to a decision tree approach, where the outcome we would like to predict is the gender of that particular person.

Intuitively, the decision tree simply takes the training data (the root of the tree), and iteratively splits it into two (or more) parts (branches of the tree) based on some attribute of the data such as larger or smaller than 150 lbs. For each split of the data, the tree recursively splits the data again until there is no possible split left (all the data points are identical) or some other stopping condition such as a maximum depth. The last level of the tree is called a leaf, and contains the subset of data points that are left over after each split. The final constructed tree is the trained model.

The choice of the sequence of questions to split the data is dictated by the improvement of the particular metric you are trying to optimize (such as a proxy for accuracy). There are many variations on how to train a decision tree based on concepts such as concept of information gain but the details beyond the scope of this blog. Instead, we’ll stick with a simple example to convey the intuition.

Once the tree is constructed, you can use it for prediction taking a data point and simply traversing the tree from the root, and following the branch that your data point lies in. The final data in the final leaf you end with then becomes your prediction.

Gender | Height (feet) | Weight (lbs) | Foot Size |
---|---|---|---|

male | 6’0 | 180 | 12 |

male | 5’11 | 190 | 11 |

male | 5’7 | 170 | 8 |

female | 5’5 | 130 | 4 |

female | 5’9 | 150 | 9 |

male | 5’11 | 165 | 10 |

female | 5’10 | 180 | 9 |

*Table 1: An example of tabular data format showing physical attributes of several people.*

The columns of data themselves should ideally be uncorrelated i.e. one column varies independently of the other columns e.g. weight and hair colour.

**Figure 1: An example of a decision tree model where each layer corresponds to a column (feature).**

As a simple numerical example, we can create a decision tree shown in Figure 1 based on the data in Table 1. The tree would typically be more balanced than the one in Figure 1, but will do for now as a conceptual platform.

The decision tree in Figure 1 starts off at the root (at the top) with all the data present, from which we ask a set of questions i.e. Is the height greater than 5 ft 11 inches?, in the hope that we can improve our confidence of what the resulting outcome will be. The branching questions we ask of the data, reduces the number of data points at each stage, until we reach the leaf node at which point we have reduced the data to the maximal depth we want, or there is no more information to be gained by splitting the data further (reach a point where the data points at that node are either all female or all male). Each branching question has two answers (Yes/No) for Figure 1 the left edge corresponds to a Yes answer, whilst the other edge denotes a “No” answer.

With the trained tree from Figure 1 we can try it out with some new data, let’s assume we have a person with the following attributes in Table 2

Gender | Height (feet) | Weight (lbs) | Foot Size |
---|---|---|---|

x | 5’7 | 190 | 10 |

*Table 2: An example of a new person set of attributes to predict the gender from.*

When trying to classify a new set of data with the Decision tree we always start at the root node and step through the tree until we reach a leaf node with the prediction.

**Figure 2: Showing the process of classification with a trained decision tree**

Figure 2 highlights how we step through a decision tree to obtain a decision. The blue lines is the decision outcome at each branching question. Following the data through we can see that resultant leaf node is a pure one, which makes the final decision process easy, and we can predict that this “new” person is Male.

Increasing the number of dimensions of the data will lead to an increase in the depth of the tree (how many splits we have to make). Although superficially it seems like a good thing, growing the depth of a tree can lead to trees that predict data that it has already seen really well (i.e. memorizing the training data), but poor prediction on unseen data (i.e. poor generalization and predictions for new data points). The result is that predicting a new data point prediction could fail miserably. This type of failure is usually related to a problem called *overfitting* to the data — a recurring problem of every data scientist.

Another problem that could arise with a single decision tree is the equal data scenario. To show this case, let’s change the foot size from 10 to 9 from our example “new” person and we follow the tree through for this modified “new” person we find that the leaf node is now given by (male = 1, female = 1) how can we decide which one to go with? We can not safely predict an outcome based purely on this node anymore as each decision is just as likely to be true based on data. This is a problem.

A Decision tree algorithm can utilise a technique called pruning which aims to combat overfitting by reducing trees/removing branching question nodes, from either a node down or a leaf up approach. Such pruning techniques are based on heuristics, with the simplest version uses the most popular class as a substitute at that branching question node. By checking the overall accuracy of the model for any degradation in the predictions, we will decide if we should keep the change.

In the next post in this series of blog posts, we’ll move onto another simple method, beyond the pruning of data, to reduce overfitting using random forests.

If the world of applied machine learning gets you excited, Rubikloud is always looking for ambitious and curious data scientists. So please reach out: http://rubikloud.wpengine.com/careers/

**references**

[1] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. http://dmlc.cs.washington.edu/data/pdf/XGBoostArxiv.pdf

[2] Jerome H. Friedman Greedy Function Approximation: A Gradient Boosting Machine. IMS 1999 Reitz Lecture 4