Hello, world! We are Alexander Hinton and Chris Dulhanty, two University of Guelph students who have recently gone down the machine learning rabbit hole through enrollment in Professor Graham Taylor’s Introduction to Machine Learning class at the U of G last fall. The course provided us with a fantastic overview of the world of machine learning (ML) algorithms and techniques. In addition, we had the opportunity to get our hands dirty in a real-world machine learning problem via a unique academic-industry partnership, fostered between our professor and Rubikloud. The final project for our course was a retail prediction problem, made possible through access to an anonymized, proprietary dataset of transactional data, from a major health and beauty retailer, hosted by Rubikloud on the RubiOne platform. This blog documents our experience with the project: the research problem, our approach and results, and lessons learned along the way!
Rubikloud provided a rich dataset of anonymized transactional data for one million customers enrolled in a major health and beauty retailer’s loyalty program for the time period January 1st, 2015 to December 31st, 2016. Data on 18,198,302 customer purchases was provided; each entry in the dataset represented a single product purchased within an order from a customer, along with the date, quantity, price and contents of the transaction. Product data was provided with price, brand, and four levels of hierarchical categorization and customer data was provided with the date of registration in the retailer’s loyalty program. Five top brands at the retailer were identified that represented 25.3% of all transactions for the one million customer cohort.
The problem was defined as a customer-level purchase prediction problem, that is, to predict which customers were most likely to buy each of the five top brands in the 30-day period following the last date in the dataset. The output of our model would ultimately be a prediction on the scale of 0 to 1 for each customer, for each of the five top brands.
A baseline model was created by Rubikloud Chief Data Scientist Brian Keng and made available to the class. On November 9, 2017, Rubikloud Chief Engineer Adrian Petrescu made his way down the 401 to stop in at our class, provide a demo of the RubiOne platform and walk us through the baseline model. The model was based on the popular recency, frequency, monetary (RFM) method of assessing customer value and engagement. Customers were scored in these three areas by aggregating their purchasing habits over the two-year period, and were binned into groups from one to ten on each of these three characteristics. Predictions were then made by summing an individual customer’s scores and dividing by 30.
Due to the class imbalance present in the dataset, classification accuracy would not have been a suitable metric for the evaluation of predictions, as the naive example of predicting everyone to purchase nothing would result in accuracy in excess of 98%! Therefore, the area under the receiver operator characteristic curve (ROC) was selected by Rubikloud as the metric to compare models, the relative measure of true positives rate to false positives rate. While the baseline RFM model was very simple in nature, it provided a reasonably high mean ROC of 0.7240 across the five brands. Rubikloud’s challenge for our class was to use ML techniques to construct a model which would provide a better mean ROC than their baseline model… which was no easy task, as the relative simplicity of their model exceeded the expectations of their Chief Data Scientist! Clearly, we would need a plan.
Making a Plan
After some preliminary research into the field of marketing, it was apparent that the RFM features extracted in the baseline model are widely used by retailers to segment their customers. While the technique of binning customers, making predictions and using these predictions to make business decisions is common and effective in the marketing world, it does not have any inherent learning aspect to it, and thus was ripe for improvement with ML.
With this in mind, our plan of action became clear – we would apply machine learning techniques to the RFM model, add a couple simply engineered features from the data, and create a retail prediction model that could be easily implemented at retailers with limited overhead in further data processing.
We compared three different ML models in our quest for the best: logistic regression (LR), a deep neural network (DNN) and a long-short term memory (LSTM) network. LR is the workhorse of classification algorithms in ML, relatively simple and very fast to implement. In our review of related literature, the algorithm provided strong results in similar problems, so we were interested if the KISS (keep it simple, stupid) methodology would be the way to go. DNNs extend LR to perform nonlinear calculations through a hierarchy of progressively rich layers of features. Having a model that could model nonlinear patterns was important, since we had the intuition that purchasing habits can be both predictable and sporadic. The LSTM network was our sexiest model, also the most specific to the problem at hand. An LSTM network is a type of Recurrent Neural Network (RNN) that uses outputs from previous time steps as inputs to future predictions – very prudent to our prediction of future purchase habits based on historical data.
Four features were extracted from the transaction data for each of the five brands, for each customer in the time-period January 1st, 2015 to December 31st, 2016:
- Recency: number of days from the final day in the time-period to the last purchase of the customer
- Frequency: count of unique transactions a customer made in the time-period
- Monetary: sum of products purchased in the time-period
- Amount: count of total products a customer purchased in the time-period
One additional feature was also extracted for each customer – the number of days since registration, which we call Duration. Each customer therefore contained four features for each of the five brands and one additional feature for a total of 21. We refer to these features as RFMAD.
Experiments and Results
The hyperparameters of our models were tuned using five-fold cross-validation. For the LR model, regularization strength was chosen via a grid search. For the DNN and LSTM, the learning rate, number of hidden units and number of hidden layers were selected by an initial coarse random search, followed by a finer Bayesian optimization using the SigOpt API. Many hours were spent tuning these hyperparameters. As the progress bar slowly inched forward, we contemplated the big questions of life… “Who am I”, “What is my purpose”, and most notably, “Can I afford a new GPU next semester?”.
In the LR and DNN models, features were calculated annually, for the LSTM, monthly. Features were fed into each model as inputs, with the target output being a multi-label prediction of the five brand-level, binary labels of purchases in the subsequent month.
The benchmark RFM model provided by Rubikloud achieved a mean ROC of 0.7240 across the five brands. Our RFMAD features, combined with the ML algorithms, were able to significantly improve on this performance. The optimized LR achieved a mean ROC of 0.7522, the DNN achieved a mean ROC of 0.7528, and the LSTM was the top performer, achieving a mean ROC of 0.7563.
Assuming the distribution of ROC scores from the cross-validation procedure were approximately normal for each individual model, we conducted t-tests to determine if differences in mean ROC across models were significant. All models were found to be significantly better performers than the baseline RFM model, with the LSTM model being significantly better than the LR and NN at the 10% significance level.
Although the evidence was not striking, the LSTM model was the best classifier of the three. This was not surprising, given the temporal nature of the dataset and the proven success of LSTM models in many applications with temporal components. Retailers who are already tracking customer segmentation variables such as RFM can make demonstrable improvements in their predictive power by including machine learning models in their retail purchase forecasts. For optimal results, an LSTM network is recommended. However, given the large differences in total training time between the LSTM network and the LR classifier and only marginal improvements in performance, it would be very reasonable to incorporate a logistic regression classifier.
Having the opportunity to work on a real-world problem and dataset was an invaluable experience. In much of our earlier coursework, datasets were small, assignments more defined, and guidance was provided all along the way. The path taken on this project was much more of our own vision – we were provided a large dataset and a problem to approach, but the freedom to attack it however we saw fit. We experienced much more of the data science pipeline than we had in the classroom. We wrestled with Pandas, spent hours whiteboarding our ideas, tinkered with code, read man page after man page, and after hours of work thought “oh! But what if we included xyz” … and we did it all over again!
We are grateful to Rubikloud for access to their platform and their data, and allowing our class to have this experience. We hope it is just the first of many projects for us in the world of Data Science!