Preventing Customer Churn, Part 2. Building the ML Model


Table of contents

  1. Background
  2. Setup
  3. Data Exploration
  4. Model Training
  5. Feature Importance
  6. Model Hosting
  7. Model Evaluation
  8. Updated model
  9. Summary

Background

_This notebook has been adapted from Amazon SageMaker Examples - Customer Churn. That notebook had been adapted from the AWS blog post and AWS notebook._

In this version of the notebook, in addition to building the predictive model we explore: what are the factors that affect churn? How do we create a targeted incentive that we (as the provider) think is most likely to reduce churn, with the minimum cost to us? In another enhancement, we use Amazon SageMaker Hyperarameter Tuning to find better tuning parameters for our XGBoost model.

Losing customers is costly for any business. Identifying unhappy customers early on gives the business a chance to offer them incentives to stay. This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.

We use an example of churn that is familiar to all of us – leaving a mobile phone operator. Seems like I can always find fault with my provider du jour! And if my provider knows that I’m thinking of leaving, it can offer timely incentives – I can always use a phone upgrade or perhaps have a new feature activated – and I might just stick around. Incentives are often much more cost effective than losing and reacquiring a customer.


Setup

This notebook was created and tested on an ml.m4.xlarge notebook instance.

Import the libraries:

In [1]:
import os
import sys
import boto3
import re
import sagemaker

# To get the container for training
from sagemaker.amazon.amazon_estimator import get_image_uri
# To run predictions against the model 
from sagemaker.predictor import csv_serializer
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

# Data manipulations:
import pandas as pd
import numpy as np

# Data Plotting
import matplotlib.pyplot as plt
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

# To un-pickle the model after training
import pickle as pkl

# To work with trees from xgboost
import json

# To mix code and markdown
from IPython.display import Markdown

For training we are using a container with xgboost version 0.90-1. In order to successfully un-pickle the model artifact we need to install the corresponding version of xgboost, because pip will install the newest version of the module by default.

In [2]:
# upgrade pip
!{sys.executable} -m pip install --upgrade pip
# as we are training the model using version 0.90-1 of the container 
# we need to install the version of xgboost that matches even if it is not the latest one.
!{sys.executable} -m pip install xgboost==0.90
!{sys.executable} -m pip install graphviz
# graphviz for plotting an interactive graph of relevant features
import graphviz
Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (20.0.2)
Requirement already satisfied: xgboost==0.90 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (0.90)
Requirement already satisfied: scipy in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from xgboost==0.90) (1.1.0)
Requirement already satisfied: numpy in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from xgboost==0.90) (1.14.3)
Requirement already satisfied: graphviz in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (0.13.2)
In [3]:
import xgboost
# verify module version
print(f"Installed version of xgboost library is {xgboost.__version__}")
# for creating a feature tree
from xgboost import plot_tree
# for calculating feature importance
from xgboost import plot_importance
Installed version of xgboost library is 0.90

For this use case, we've created the S3 bucket and appropriate IAM roles for you during the launch of the AWS CloudFormation template. The bucket name was saved in a parameter file called "cloudformation_values.py" during creation of the notebook instance, along with the DB secret name and ML endpoint name.

In [4]:
import cloudformation_values as cfvalues
bucket = cfvalues.S3BUCKET
endpoint_name = cfvalues.ENDPOINT
# AWS Secrets stores our database credentials. 
db_secret_name = cfvalues.DBSECRET
In [5]:
accountid = boto3.client('sts').get_caller_identity()['Account']
role = sagemaker.get_execution_role()
sess = sagemaker.Session()

# provide a prefix to be attached to the output files in the bucket
prefix = 'sagemaker/xgboost-churn'
In [6]:
# import custom functions for data and feature exploration
import utilities

Global variables:

In [7]:
# from ColorBrewer
plot_color = "#4daf4a"

Data Exploration

Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes–after all, predicting the future is tricky business! But I’ll also show how to deal with prediction errors.

The dataset we use is publicly available and was mentioned in the book Discovering Knowledge in Data by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets.

Loading the data from Amazon S3

In the Part 1 notebook for this use case we demonstrated unloading the production data from Amazon Aurora to an S3 bucket. Here we will read the unloaded data from the S3 bucket into a Pandas dataframe.

In this case, there's a small amount of data and it was unloaded as a single file with a header, so the column names exist as part of the data. We'll explore this data and split it into test, train and validation sets locally.

With large amounts of data, Amazon Aurora will create multiple files. In that case we could choose to explore one part only to get a basic understanding of the data, and use the multiple part files to create a 'natural' split into test etc. sets for us. In that case we likely would not have the column headers as part of the file, and would need to add them manually, as shown by the two commented out lines.

In [8]:
data_key = 'aurora/churn_data.part_00000'

data_location = 's3://{}/{}'.format(bucket, prefix + '/' + data_key)

# If we needed to add column names manually, we'd need to get the column names from another metadata 
# source (or from our friendly DBA).
# churn_cols = ['state','acc_length','area_code','phone','int_plan','vmail_plan','vmail_msg',
#  'day_mins','day_calls','day_charge','eve_mins','eve_calls','eve_charge','night_mins',
#  'night_calls','night_charge','int_mins','int_calls','int_charge','cust_service_calls','churn']
# Use this format if the names were passed manually.
# churn = pd.read_csv(data_location, index_col = 0, header = None, names = churn_cols) 
churn = pd.read_csv(data_location)

Preview the first few rows:

In [9]:
pd.set_option('display.max_columns', 25)
churn.head()
Out[9]:
state acc_length area_code phone int_plan vmail_plan vmail_msg day_mins day_calls day_charge eve_mins eve_calls eve_charge night_mins night_calls night_charge int_mins int_calls int_charge cust_service_calls churn
0 KS 128 415 382-4657 no yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0 3 2.70 1 False.
1 OH 107 415 371-7191 no yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 3 3.70 1 False.
2 NJ 137 415 358-1921 no no 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2 5 3.29 0 False.
3 OH 84 408 375-9999 yes no 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 False.
4 OK 75 415 330-6626 yes no 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 False.

By modern standards, it’s a relatively small dataset, with only 3,333 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:

  • state: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
  • acc_length: the number of days that this account has been active
  • area_code: the three-digit area code of the corresponding customer’s phone number
  • phone: the remaining seven-digit phone number
  • int_plan: whether the customer has an international calling plan: yes/no
  • vmail_plan: whether the customer has a voice mail feature: yes/no
  • vmail_msg: presumably the average number of voice mail messages per month
  • day_mins: the total number of calling minutes used during the day
  • day_calls: the total number of calls placed during the day
  • day_charge: the billed cost of daytime calls
  • eve_mins, eve_calls, even_charge: the billed cost for calls placed during the evening
  • night_mins, night_calls, night_charge: the billed cost for calls placed during nighttime
  • int_mins, int_calls, int_charge: the billed cost for international calls
  • curs_service_calls: the number of calls placed to Customer Service
  • churn: whether the customer left the service: true/false

The last attribute, churn, is known as the target attribute–the attribute that we want the ML model to predict. Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.

Exploring the Data

Let's begin exploring the data.

_This section is identical to the original notebook, Amazon SageMaker Examples - Customer Churn. While data exploration is an important topic, it's not the focus of this walk through. Therefore it's been removed in the interests of brevity, but the actions taken based on the analysis have been kept (i.e., columns kept/removed). Please refer to the original notebook for this section._

In [10]:
churn = churn.drop('phone', axis = 1)
churn['area_code'] = churn['area_code'].astype(object)

Let's remove one feature from each of the highly correlated pairs:

  1. day_charge from the pair with day_mins;
  2. night_charge from the pair with night_mins;
  3. eve_charge from the pair with the eve_mins;
  4. int_mins from the pair with int_charge.
In [11]:
churn = churn.drop(['day_charge', 'eve_charge', 'night_charge', 'int_mins'], axis = 1)

Now that we've cleaned up our dataset, let's determine which algorithm to use. As mentioned above, there appear to be some variables where both high and low (but not intermediate) values are predictive of churn. In order to accommodate this in an algorithm like linear regression, we'd need to generate polynomial (or bucketed) terms. Instead, let's attempt to model this problem using gradient boosted trees. Amazon SageMaker provides an XGBoost container that we can use to train in a managed, distributed setting, and then host as a real-time prediction endpoint. XGBoost uses gradient boosted trees which naturally account for non-linear relationships between features and the target variable, as well as accommodating complex interactions between features.

Amazon SageMaker XGBoost can train on data in either a CSV or LibSVM format. For this example, we'll stick with CSV. It should (documentation here):

  • Have the predictor variable in the first column
  • Not have a header row

Since the model will not have the feature names later, when we explore the results, we will need to assign them from the original data (excluding the target variable)

But first, let's convert our categorical features into numeric features as the algorithm manages only numeric features. Then, we place the outcome as the first column.

In [12]:
model_data = pd.get_dummies(churn)
model_data = pd.concat([model_data['churn_True.'], 
                        model_data.drop(['churn_False.', 'churn_True.'], axis = 1)], axis = 1)

And now let's split the data into training, validation, and test sets. This will help prevent us from overfitting the model, and allow us to test the models accuracy on data it hasn't already seen.

Note that different splits of the data may create slightly different results. In addition, on different runs against the same data, XGBoost may choose different combinations of features and trees that give similar model performance.

In [13]:
train_data, validation_data, test_data = np.split(model_data.sample(frac = 1, random_state = 1729), 
                                                  [int(0.7 * len(model_data)), int(0.9 * len(model_data))])
train_data.to_csv('train.csv', header = False, index = False)
validation_data.to_csv('validation.csv', header = False, index = False)
test_data.to_csv('test.csv', header = False, index = False)
In [14]:
train_data.head()
Out[14]:
churn_True. acc_length vmail_msg day_mins day_calls eve_mins eve_calls night_mins night_calls int_calls int_charge cust_service_calls ... state_VT state_WA state_WI state_WV state_WY area_code_408 area_code_415 area_code_510 int_plan_no int_plan_yes vmail_plan_no vmail_plan_yes
1095 0 106 0 274.4 120 198.6 82 160.8 62 3 1.62 1 ... 0 0 0 0 0 0 0 1 1 0 1 0
608 0 28 0 187.8 94 248.6 86 208.8 124 5 2.86 0 ... 0 0 0 0 1 0 1 0 1 0 1 0
2908 1 148 0 279.3 104 201.6 87 280.8 99 2 2.13 2 ... 0 0 0 0 0 0 1 0 1 0 1 0
943 0 132 0 191.9 107 206.9 127 272.0 88 2 3.40 1 ... 0 0 0 0 0 0 0 1 1 0 1 0
693 0 92 29 155.4 110 188.5 104 254.9 118 4 2.16 3 ... 0 0 0 0 0 0 0 1 1 0 0 1

5 rows × 70 columns

Now we'll upload these files to S3.

In [15]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

Model Training with XGBoost using HPO

Now we'll train our XGBoost model, using HPO (hyperparameter optimization) to explore a number of hyperparameters, looking for the best model.

First we'll need to specify the locations of the XGBoost algorithm containers. XGBoost can be used in two ways: either as a built-in algorithm or as a framework, see the documentation here. Since we are using it as a container it is important to install the corresponding version of xgboost library for the analyses below (see the section Feature Importance). At the point of the writing this notebook there were two versions of the container available: 0.90-1 and 0.90-2. All training here was performed with the version 0.90-1.

Because we're always cautious and curious, we'll check the sizes of the data splits, to make sure we have a meaningful number of records in each.

In [16]:
print('Model data shape', model_data.shape)
print('Train data:', train_data.shape)
print('Validation data:', validation_data.shape)
print('Test data:', test_data.shape)
Model data shape (3333, 70)
Train data: (2333, 70)
Validation data: (666, 70)
Test data: (334, 70)
In [17]:
container = get_image_uri(boto3.Session().region_name, 'xgboost', repo_version = '0.90-1')
print(container)
246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:0.90-1-cpu-py3

Then, because we're training with the CSV file format, we'll create s3_inputs that our training function can use as a pointer to the files in S3.

In [18]:
s3_input_train = sagemaker.s3_input(s3_data = 's3://{}/{}/train'.format(bucket, prefix),
                               content_type = 'csv')
s3_input_validation = sagemaker.s3_input(s3_data = 's3://{}/{}/validation/'.format(bucket, prefix),
                                    content_type = 'csv')
In [19]:
print('s3://{}/{}/train'.format(bucket, prefix))
print('s3://{}/{}/validation/'.format(bucket, prefix))
s3://111111111111-s3-bucket-for-model-training/sagemaker/xgboost-churn/train
s3://111111111111-s3-bucket-for-model-training/sagemaker/xgboost-churn/validation/

Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters. A few key hyperparameters are:

  • max_depth controls how deep each tree within the algorithm can be built. Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting. There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
  • subsample controls sampling of the training data. This technique can help reduce overfitting, but setting it too low can also starve the model of data.
  • num_round controls the number of boosting rounds. This is essentially the subsequent models that are trained using the residuals of previous iterations. Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
  • eta controls how aggressive each round of boosting is. Larger values lead to more conservative boosting.
  • gamma controls how aggressively trees are grown. Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on the GitHub page for Amazon SageMaker Developer Guide.

Ideally, a data scientist would perform k-fold cross-validation procedure to identify the set of parameters that produce the best performing model, but for brevity, we are not doing it here.

We set initial hyperparameters, then set ranges of the hyperparameters that we'd like to try. These parameters are described further here, and the hyperparameters are described here.

In [20]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                               role,
                               train_instance_count=1,
                               train_instance_type = 'ml.m4.xlarge',
                               output_path = 's3://{}/{}/output'.format(bucket, prefix),
                               sagemaker_session = sess)
xgb.set_hyperparameters(max_depth = 5,
                        eta = 0.2,
                        gamma = 4,
                        min_child_weight = 6,
                        subsample = 0.8,    
                        silent = 0,
                        objective = 'binary:logistic',
                        num_round = 100)
In [21]:
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                        'min_child_weight': ContinuousParameter(1, 10),
                        'alpha': ContinuousParameter(0, 2),
                        'max_depth': IntegerParameter(1, 10),
                        'num_round': IntegerParameter(10,100)}

# The built-in XGBoost algorithm emits two predefined metrics: validation:auc and train:auc
objective_metric_name = 'validation:auc'       

tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs = 30,
                            max_parallel_jobs = 3,
                            early_stopping_type = 'Auto')
In [22]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation}, include_cls_metadata=False)
In [23]:
tuning_job_name = tuner.latest_tuning_job.job_name
tuning_job_name
Out[23]:
'sagemaker-xgboost-200416-0049'

The next cell will wait while the set of HPO jobs are run. They will take approx. 30 minutes.

In [24]:
tuner.wait()
.......................................................................................................................................................................................................................................................................................................................................................................................................................!
In [26]:
# run this cell to check current status of hyperparameter tuning job
region = boto3.Session().region_name
sm_client = boto3.Session().client('sagemaker')

tuning_job_result = sm_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name)

status = tuning_job_result['HyperParameterTuningJobStatus']
if status != 'Completed':
    print('Reminder: the tuning job has not been completed.')
    
job_count = tuning_job_result['TrainingJobStatusCounters']['Completed']
print(f"{job_count} training jobs have completed")
    
is_minimize = (tuning_job_result['HyperParameterTuningJobConfig']['HyperParameterTuningJobObjective']['Type'] != 'Maximize')
objective_name = tuning_job_result['HyperParameterTuningJobConfig']['HyperParameterTuningJobObjective']['MetricName']
28 training jobs have completed

Let's take a look at the hyperparameters used and metrics returns by the best training job.

In [27]:
from pprint import pprint
if tuning_job_result.get('BestTrainingJob',None):
    print("Best model found so far:")
    pprint(tuning_job_result['BestTrainingJob'])
else:
    print("No training jobs have reported results yet.")
Best model found so far:
{'CreationTime': datetime.datetime(2020, 4, 16, 0, 56, 46, tzinfo=tzlocal()),
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'validation:auc',
                                                 'Value': 0.8895599842071533},
 'ObjectiveStatus': 'Succeeded',
 'TrainingEndTime': datetime.datetime(2020, 4, 16, 0, 59, 46, tzinfo=tzlocal()),
 'TrainingJobArn': 'arn:aws:sagemaker:us-west-2:111111111111:training-job/sagemaker-xgboost-200416-0049-009-635392a2',
 'TrainingJobName': 'sagemaker-xgboost-200416-0049-009-635392a2',
 'TrainingJobStatus': 'Completed',
 'TrainingStartTime': datetime.datetime(2020, 4, 16, 0, 58, 41, tzinfo=tzlocal()),
 'TunedHyperParameters': {'alpha': '0.07181080493333837',
                          'eta': '0.49241435238740405',
                          'max_depth': '9',
                          'min_child_weight': '7.280230839548615',
                          'num_round': '11'}}
In [28]:
best_training_job = tuning_job_result['BestTrainingJob']['TrainingJobName']

response = sm_client.describe_training_job(
    TrainingJobName=best_training_job
)
print('Hyperparameters:')
pprint(response['HyperParameters'], indent = 4)
print('Metrics:')
pprint(response['FinalMetricDataList'], indent = 4)
themodel = response['ModelArtifacts']['S3ModelArtifacts']
print()
print('Model location:',themodel)
Hyperparameters:
{   '_tuning_objective_metric': 'validation:auc',
    'alpha': '0.07181080493333837',
    'eta': '0.49241435238740405',
    'gamma': '4',
    'max_depth': '9',
    'min_child_weight': '7.280230839548615',
    'num_round': '11',
    'objective': 'binary:logistic',
    'silent': '0',
    'subsample': '0.8'}
Metrics:
[   {   'MetricName': 'validation:auc',
        'Timestamp': datetime.datetime(1970, 1, 19, 8, 49, 58, 775000, tzinfo=tzlocal()),
        'Value': 0.8895599842071533},
    {   'MetricName': 'train:auc',
        'Timestamp': datetime.datetime(1970, 1, 19, 8, 49, 58, 775000, tzinfo=tzlocal()),
        'Value': 0.9602640271186829},
    {   'MetricName': 'ObjectiveMetric',
        'Timestamp': datetime.datetime(1970, 1, 19, 8, 49, 58, 775000, tzinfo=tzlocal()),
        'Value': 0.8895599842071533}]

Model location: s3://111111111111-s3-bucket-for-model-training/sagemaker/xgboost-churn/output/sagemaker-xgboost-200416-0049-009-635392a2/output/model.tar.gz
In [29]:
# extract indices of the training and validation AUCs:
train_res = utilities.get_auc_from_metrics(response, "train:auc")
val_res = utilities.get_auc_from_metrics(response, "validation:auc")
In [30]:
Markdown(f"""
The AUC looks pretty good. In our sample run, we saw a training AUC of 
{round(response['FinalMetricDataList'][train_res]['Value'], 3)}, and a validation
AUC of {round(response['FinalMetricDataList'][val_res]['Value'], 3)}.
""")
Out[30]:

The AUC looks pretty good. In our sample run, we saw a training AUC of 0.96, and a validation AUC of 0.89.

Note, due to randomized elements of the XGBoost algorithm and of SageMaker hyperparameter optimization, your results may differ slightly.

At this point, you can use the Analyze HPO Tuning Job sample notebook to further study the effectiveness of the hyperparameter tuning job.


Feature Importance

Now that we have a model, we'd like to understand what factors are of highest impact in our model. By understanding these factors, we hope to be able to intervene in order to prevent customer churn.

In order to analyze the model we will copy the model saved on S3 bucket during training to the notebook. We will then load it using the load function from the pickle module. xgb.model_data provides the location (uri) of the *.tar.gz file in the S3 bucket.

In order to sucessfully un-pickle the file ensure that you have the matching version of xgboost library installed for the version of the SageMaker XGBoost container used above. (See the first few code blocks of this notebook.)

In [31]:
# copy the model locally to the notebook
!aws s3 cp $themodel .
# unpack the file, this will produce file "xgboost-model"
!tar -zxvf model.tar.gz -C .
download: s3://111111111111-s3-bucket-for-model-training/sagemaker/xgboost-churn/output/sagemaker-xgboost-200416-0049-009-635392a2/output/model.tar.gz to ./model.tar.gz
xgboost-model
In [32]:
model_booster = pkl.load(open("xgboost-model", 'rb'))

model_booster is an object of class xgboost.core.Booster. It has many methods available, see the documentation. However, because we don't have the feature names available during training, we will first assign them here to replace non-descriptive labels "f*". Make sure to drop the first column name from the data, as it's the outcome variable (churn).

In [33]:
model_booster.feature_names = list(model_data.columns)[1:]

Let's take a look at the most important features found by this model.

Here we present bar plots with features ranked by their importance. There are 5 types of feature importance available in xgboost library: gain, weight, cover, total gain and total cover. See the API reference for descriptions. For each bar plot we choose to show the top features.

From the XGBoost tutorial the following explanations are provided for the measures of feature importance:

Gain (a.k.a. Gini feature importance) is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branches is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).

Cover measures the relative quantity of observations concerned by a feature.

Frequency (a.k.a. weight) is a simpler way to measure the Gain. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
In [34]:
# We defined a custom function plot_feature_importance that plots up to 15 features ranked by their
# importance using either gain, cover or weight. The function relies on xgboost's plot_importance method.

# gain: the average gain across all splits the feature is used in.
# weight: the number of times a feature is used to split the data across all trees.
# cover: the average coverage across all splits the feature is used in.
# total_gain: the total gain across all splits the feature is used in.
# total_cover: the total coverage across all splits the feature is used in.
for metric in ['gain', 'cover', 'weight']:    
    utilities.plot_feature_importance(model_booster, metric)

In our runs, three features frequently dominate: day_mins, cust_service_calls, and international plan usage.

Note, due to randomized elements of the algorithm, your results may differ.

While we can see that these features are the most important, the measure does not tell us the direction: for example, are a higher number of day_mins predictive of churn, or a lower number? Is int_plan_no (a Boolean value) True, or False, predictive? For that, additional exploration and analysis is needed. Often, the business will have a good intuition that can be validated from the data.

We'd like to better understand what the model is doing. To get some better insight, let's plot some of the model trees.

We'll start with the first tree. In XGBoost, each additional tree is added to maximize the gain, so looking at the first tree should give us some insight into the data.

In [35]:
tree_to_plot = 0  
the_tree = plot_tree(model_booster, num_trees=tree_to_plot, rankdir='LR')
fig_size = plt.gcf().get_size_inches()      #Get current size
sizefactor = 8     #Set a zoom factor
# Modify the current size by the factor
plt.gcf().set_size_inches(sizefactor * fig_size) 

# The plots can be hard to read (known issue). So, separately save it to a PNG, which makes for easier viewing.
fig = plt.gcf()
fig.savefig('tree' + str(tree_to_plot)+'.png')
plt.show()

Interesting! In our runs, we saw the potential to use these trees to suggest or understand different kinds of churners. We can easily plot additional trees, or pull a list of the key splits, such as the top few splits of the various trees. Showing this kind of information to Marketing may give them some ideas for different "churn profiles" amongst customer segments, leading to different targeted incentive programs.

In our runs we frequently see this first tree splitting on cust_service_calls = 3.5. Let's do a little exploration.

In [39]:
churn['csc_4ormore'] = churn.apply(lambda x: 1 if x['cust_service_calls'] >= 4 else 0, axis=1)
pd.crosstab(churn['csc_4ormore'], churn['churn'], normalize='index')
Out[39]:
churn False. True.
csc_4ormore
0 0.887476 0.112524
1 0.483146 0.516854

A little data exploration supports the idea that 4 or more calls shift the percentage of churners. We’ll likely want to intervene before they make that fourth phone call.

We can also draw additional trees. Let's get a sense of the number and size of the other trees we have in this model. In past XGBoost projects, we've seen that in some cases - primarily where the data has little information, or where the model is not converging well - some or even many or most of the trees may contain only a single leaf, and so do not add any decision power to the model. Let's see whether that's affecting this model.

In our module utilities we define functions to calculate the depth of each tree and return the number of tree with each value of depth.

In [40]:
# Extract all trees in json format
tree_dump = model_booster.get_dump(dump_format = 'json')

# Get depth for each tree
all_depths = utilities.get_depths_as_list(tree_dump)

# Calculate the distribution of tree depths for plotting
unique_tree_count = utilities.calculate_list_unique_elements(all_depths)
In [41]:
print("There are: " + str(len(tree_dump)) + " trees in this forest.")
plt.grid(b = True,  which = 'major', axis = 'y', color = "lightgrey")
tree_depth_barplot = plt.bar(unique_tree_count.keys(), 
                      unique_tree_count.values(), 
                      color = "#4daf4a")
yticks = plt.yticks(np.arange(0, max(unique_tree_count.values())+2, step = 5))
ylabel = plt.ylabel('Trees')
xlabel = plt.xlabel("Depth")
title = plt.title("Distribution of tree depths for all trees in the model")
plt.show()
There are: 11 trees in this forest.

In our test run, several trees contain only a leaf, so it seems some further optimization is possible. But for now we'll move on.

We're really interested in how our features have been used across the trees, to get a sense of which of our features influence the model.

In our module utilities we define additional functions that allow us to find all trees where features have been used to split the trees.

In [42]:
feature_counts = utilities.count_trees_with_features(tree_dump, model_booster.feature_names)
# convert it to pandas series object
ser1 = pd.Series(feature_counts)
In [43]:
figsize = plt.figure(figsize = [12,6])
plt.grid(b = True,  which = 'major', axis = 'both', color = "lightgrey")
feature_splits_scatterplot = plt.scatter(ser1.sort_values().index, 
                      ser1.sort_values().values, 
                      color = "#4daf4a")
plt.yticks(np.arange(0, max(ser1.sort_values().values) + 5, step = 5))
plt.xticks(rotation = 90, fontsize = 9)
plt.ylabel('Trees')
plt.xlabel("Features")
plt.title("How many times is each feature used to create a split?")
plt.show();

On the figure above we observe that the majority of features are not even used in the tree splits. These features are adding no value to the model. Here are the features that are used at least once:

In [44]:
ser1[ser1 != 0].sort_values(ascending=False)
Out[44]:
day_mins              10
cust_service_calls     9
eve_mins               9
int_plan_no            7
int_charge             7
night_mins             6
vmail_msg              6
int_calls              5
day_calls              2
night_calls            1
eve_calls              1
dtype: int64
In [45]:
Markdown(f"""
Here, we see opportunity! Whereas the training data had 70 features after converting the categorical
variables to one-hot vectors, only {ser1[ser1 != 0].size} features are being used by the model. 
Perhaps we can simplify the model, which will simplify the calls from the production database as well. 
We'll try that after we've finished evaluating this model.
""")
Out[45]:

Here, we see opportunity! Whereas the training data had 70 features after converting the categorical variables to one-hot vectors, only 11 features are being used by the model. Perhaps we can simplify the model, which will simplify the calls from the production database as well. We'll try that after we've finished evaluating this model.


Model Hosting

Now that we've trained the algorithm, let's create a model and deploy it to a hosted endpoint.

Note that we're using a predefined endpoint_name here, for simplicity. The AWS CloudFormation template setup specified this endpoint_name in an IAM role, and set the Aurora DB cluster group parameter 'aws_default_sagemaker_role' to this IAM role. This combination of settings gives Aurora permission to call the Amazon SageMaker endpoint that we'll be creating here.

In [46]:
xgb_predictor = tuner.deploy(initial_instance_count = 1,
                             instance_type = 'ml.m4.xlarge',
                             endpoint_name = endpoint_name,
                             update_endpoint = True)
2020-04-16 00:59:46 Starting - Preparing the instances for training
2020-04-16 00:59:46 Downloading - Downloading input data
2020-04-16 00:59:46 Training - Training image download completed. Training in progress.
2020-04-16 00:59:46 Uploading - Uploading generated training model
2020-04-16 00:59:46 Completed - Training job completedINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training
INFO:sagemaker-containers:Failed to parse hyperparameter _tuning_objective_metric value validation:auc to Json.
Returning the value itself
INFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.
Returning the value itself
INFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)
INFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode
INFO:root:Determined delimiter of CSV input is ','
INFO:root:Determined delimiter of CSV input is ','
INFO:root:Determined delimiter of CSV input is ','
[00:59:35] 2333x69 matrix with 160977 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
INFO:root:Determined delimiter of CSV input is ','
[00:59:35] 666x69 matrix with 45954 entries loaded from /opt/ml/input/data/validation?format=csv&label_column=0&delimiter=,
INFO:root:Single node training.
INFO:root:Setting up HPO optimized metric to be : auc
INFO:root:Train matrix has 2333 rows
INFO:root:Validation matrix has 666 rows
[0]#011train-auc:0.906742#011validation-auc:0.847646
[1]#011train-auc:0.916985#011validation-auc:0.861682
[2]#011train-auc:0.918123#011validation-auc:0.864372
[3]#011train-auc:0.92949#011validation-auc:0.878123
[4]#011train-auc:0.931562#011validation-auc:0.88033
[5]#011train-auc:0.932719#011validation-auc:0.878295
[6]#011train-auc:0.946275#011validation-auc:0.881957
[7]#011train-auc:0.950941#011validation-auc:0.883363
[8]#011train-auc:0.953178#011validation-auc:0.886478
[9]#011train-auc:0.956034#011validation-auc:0.892601
[10]#011train-auc:0.960264#011validation-auc:0.88956
Training seconds: 65
Billable seconds: 65
-------------!

The following command may take several minutes to respond, while we wait for the endpoint for be created.

In [47]:
xgb_predictor.endpoint
Out[47]:
'ml14-SageMaker-Endpoint'

Model Evaluation

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request. Next, we'll need to setup serializers and deserializers for passing our test_data NumPy arrays to the model behind the endpoint.

In [48]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

Now, we'll use a simple function to:

  1. Loop over our test dataset
  2. Split it into mini-batches of rows
  3. Convert those mini-batchs to CSV string payloads
  4. Retrieve mini-batch predictions by invoking the XGBoost endpoint
  5. Collect predictions and convert from the CSV output our model provides into a NumPy array
In [49]:
def predict(data, predictor_object, rows = 200):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor_object.predict(array).decode('utf-8')])
    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.values[:, 1:], xgb_predictor)

There are many ways to compare the performance of a machine learning model, but let's start by simply by comparing actual to predicted values. In this case, we're simply predicting whether the customer churned (1) or not (0), which produces a simple confusion matrix.

In [50]:
confusion_matrix = pd.crosstab(index = test_data.iloc[:, 0], 
                               columns = np.round(predictions), 
                               rownames = ['actual'],
                               colnames = ['predictions'])
confusion_matrix
Out[50]:
predictions 0.0 1.0
actual
0 280 6
1 12 36
In [51]:
Markdown(f"""
_Note, due to randomized elements of the algorithm, your results may differ slightly._

In this run, of the 48 churners, we've correctly predicted {confusion_matrix.iloc[1][1]} of them (true positives).
And, we incorrectly predicted that {confusion_matrix.iloc[0][1]} customers would churn who then ended up not 
doing so (false positives).  There are also {confusion_matrix.iloc[1][0]} customers who ended up churning, 
that we predicted would not (false negatives).
""")
Out[51]:

Note, due to randomized elements of the algorithm, your results may differ slightly.

In this run, of the 48 churners, we've correctly predicted 36 of them (true positives). And, we incorrectly predicted that 6 customers would churn who then ended up not doing so (false positives). There are also 12 customers who ended up churning, that we predicted would not (false negatives).

To further evaluate the model, we'll calculate the accuracy, recall and precision.

In [52]:
# Adapted from here: https://towardsdatascience.com/xgboost-in-amazon-sagemaker-28e5e354dbcd
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score

thresh = 0.5
y_pred = predictions
y_true = test_data['churn_True.']

def collect_eval_metrics(true_values, predicted_values, threshold):
    metric_df = pd.DataFrame({"auc":[round(roc_auc_score(true_values, predicted_values), 4)],
                       "accuracy":[round(accuracy_score(true_values,(predicted_values > threshold)) ,4)],
                       "recall":[round(recall_score(true_values, (predicted_values > threshold)), 4)],
                       "precision":[round(precision_score(true_values, (predicted_values > threshold)),4)]})
    return metric_df

collect_eval_metrics(y_true, y_pred, thresh)
Out[52]:
auc accuracy recall precision
0 0.9362 0.9461 0.75 0.8571

In the section above Feature Importance we discovered that only a subset features are used in trees. If we focus on these features only will our new model perform worse or similar to the full model? We will repeat our model training exercise in the next section.


Updated model

We'd like to use the list of features from our model run above, along with our objective column (churn_True.) to define the list of features to retain for training. In this case, to reduce potential for errors in following this blog post, we'll freeze the list of features to match those from one of our runs.

In [55]:
#cols_list = ['churn_True.'] + ser1[ser1 != 0].index.tolist()
cols_list = ['churn_True.'] + ['acc_length', 'vmail_msg', 'day_mins', 'day_calls', 'eve_mins', 'night_mins', 
             'night_calls', 'int_calls', 'int_charge', 'cust_service_calls', 'int_plan_no']
model_data_updated = model_data[cols_list]
In [56]:
model_data_updated.head(n = 10)
model_data_updated.shape
Out[56]:
churn_True. acc_length vmail_msg day_mins day_calls eve_mins night_mins night_calls int_calls int_charge cust_service_calls int_plan_no
0 0 128 25 265.1 110 197.4 244.7 91 3 2.70 1 1
1 0 107 26 161.6 123 195.5 254.4 103 3 3.70 1 1
2 0 137 0 243.4 114 121.2 162.6 104 5 3.29 0 1
3 0 84 0 299.4 71 61.9 196.9 89 7 1.78 2 0
4 0 75 0 166.7 113 148.3 186.9 121 3 2.73 3 0
5 0 118 0 223.4 98 220.6 203.9 118 6 1.70 0 0
6 0 121 24 218.2 88 348.5 212.6 118 7 2.03 3 1
7 0 147 0 157.0 79 103.1 211.8 96 6 1.92 0 0
8 0 117 0 184.5 97 351.6 215.8 90 4 2.35 1 1
9 0 141 37 258.6 84 222.0 326.4 97 5 3.02 0 0
Out[56]:
(3333, 12)

Next, we train the model using the same parameters but with a reduced number of features. This setup is exactly the same as for the full model above.

In [57]:
train_data_updated = train_data[cols_list]
validation_data_updated = validation_data[cols_list]
test_data_updated = test_data[cols_list]

train_data_updated.to_csv('train_updated.csv', header = False, index = False)
validation_data_updated.to_csv('validation_updated.csv', header = False, index = False)
In [58]:
print('Updated model data shape', model_data_updated.shape)
print('Updated train data:', train_data_updated.shape)
print('Updated validation data:', validation_data_updated.shape)
print('Updated test data:', test_data_updated.shape)
Updated model data shape (3333, 12)
Updated train data: (2333, 12)
Updated validation data: (666, 12)
Updated test data: (334, 12)
In [59]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train_updated/train_updated.csv')).upload_file('train_updated.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation_updated/validation_updated.csv')).upload_file('validation_updated.csv')
In [60]:
s3_input_train_updated = sagemaker.s3_input(s3_data = 's3://{}/{}/train_updated/'.format(bucket, prefix),
                               content_type='csv')
s3_input_validation_updated = sagemaker.s3_input(s3_data = 's3://{}/{}/validation_updated/'.format(bucket, prefix),
                                    content_type='csv')

xgb_updated = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count = 1, 
                                    train_instance_type = 'ml.m4.xlarge',
                                    output_path = 's3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session = sess)
xgb_updated.set_hyperparameters(max_depth = 5,
                                eta = 0.2,
                                gamma = 4,
                                min_child_weight = 6,
                                subsample = 0.8,
                                silent = 0,
                                objective = 'binary:logistic',
                                num_round = 100)
In [61]:
# We'll use the same hyperparameter_ranges and objective_metric_name as above
tuner_updated = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs = 30,
                            max_parallel_jobs = 3,
                            early_stopping_type = 'Auto'
                           )
In [62]:
tuner_updated.fit({'train': s3_input_train_updated, 'validation': s3_input_validation_updated}, include_cls_metadata=False)
In [63]:
tuning_updated_job_name = tuner_updated.latest_tuning_job.job_name
tuning_updated_job_name
Out[63]:
'sagemaker-xgboost-200416-0153'
In [64]:
tuner_updated.wait()
......................................................................................................................................................................................................................................................................................................................................................................................................................................!

Now wait for approx. 30 minutes while the set of HPO jobs are run.

Once it has completed, you can run the following cells.

In [65]:
# run this cell to check current status of hyperparameter tuning job

tuning_job_updated_result = sm_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = 
                                                                            tuning_updated_job_name)
 
status = tuning_job_updated_result['HyperParameterTuningJobStatus']
if status != 'Completed':
    print('Reminder: the tuning job has not been completed.')
    
job_count = tuning_job_updated_result['TrainingJobStatusCounters']['Completed']
print("%d training jobs have completed" % job_count)
    
is_minimize = (tuning_job_updated_result['HyperParameterTuningJobConfig']['HyperParameterTuningJobObjective']['Type'] != 'Maximize')
objective_name = tuning_job_updated_result['HyperParameterTuningJobConfig']['HyperParameterTuningJobObjective']['MetricName']
30 training jobs have completed

Once again, let's review the best training job's results.

In [66]:
if tuning_job_updated_result.get('BestTrainingJob',None):
    print("Best model found so far:")
    pprint(tuning_job_updated_result['BestTrainingJob'])
else:
    print("No training jobs have reported results yet.")
Best model found so far:
{'CreationTime': datetime.datetime(2020, 4, 16, 2, 18, 24, tzinfo=tzlocal()),
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'validation:auc',
                                                 'Value': 0.8950129747390747},
 'ObjectiveStatus': 'Succeeded',
 'TrainingEndTime': datetime.datetime(2020, 4, 16, 2, 21, 26, tzinfo=tzlocal()),
 'TrainingJobArn': 'arn:aws:sagemaker:us-west-2:111111111111:training-job/sagemaker-xgboost-200416-0153-024-f834c59e',
 'TrainingJobName': 'sagemaker-xgboost-200416-0153-024-f834c59e',
 'TrainingJobStatus': 'Completed',
 'TrainingStartTime': datetime.datetime(2020, 4, 16, 2, 20, 24, tzinfo=tzlocal()),
 'TunedHyperParameters': {'alpha': '1.600141741795399',
                          'eta': '0.9498811387431696',
                          'max_depth': '3',
                          'min_child_weight': '3.7719230462705378',
                          'num_round': '56'}}

Let's take a look at the hyperparameters used and metrics returned by the best training job.

In [67]:
best_training_job_updated = tuning_job_updated_result['BestTrainingJob']['TrainingJobName']

response = sm_client.describe_training_job(
    TrainingJobName = best_training_job_updated
)
print('Metrics:')
pprint(response['FinalMetricDataList'],indent=4)
themodel = response['ModelArtifacts']['S3ModelArtifacts']
print('Model location:',themodel)
Metrics:
[   {   'MetricName': 'validation:auc',
        'Timestamp': datetime.datetime(1970, 1, 19, 8, 50, 3, 676000, tzinfo=tzlocal()),
        'Value': 0.8950129747390747},
    {   'MetricName': 'train:auc',
        'Timestamp': datetime.datetime(1970, 1, 19, 8, 50, 3, 676000, tzinfo=tzlocal()),
        'Value': 0.9682490229606628},
    {   'MetricName': 'ObjectiveMetric',
        'Timestamp': datetime.datetime(1970, 1, 19, 8, 50, 3, 676000, tzinfo=tzlocal()),
        'Value': 0.8950129747390747}]
Model location: s3://111111111111-s3-bucket-for-model-training/sagemaker/xgboost-churn/output/sagemaker-xgboost-200416-0153-024-f834c59e/output/model.tar.gz
In [68]:
Markdown(f"""
In this particular run, in a model with only {ser1[ser1 != 0].size} 
features our training and validation AUC values are close to but better than the run with 70 features.
""")
Out[68]:

In this particular run, in a model with only 11 features our training and validation AUC values are close to but better than the run with 70 features.

Host the updated model

Now, we deploy the updated model to the same endpoint_name we used previously.

In [69]:
xgb_predictor_updated = tuner_updated.deploy(initial_instance_count = 1,
                                             instance_type = 'ml.m4.2xlarge',
                                             endpoint_name = endpoint_name,
                                             update_endpoint = True)
2020-04-16 02:21:26 Starting - Preparing the instances for training
2020-04-16 02:21:26 Downloading - Downloading input data
2020-04-16 02:21:26 Training - Training image download completed. Training in progress.
2020-04-16 02:21:26 Uploading - Uploading generated training model
2020-04-16 02:21:26 Completed - Training job completedINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training
INFO:sagemaker-containers:Failed to parse hyperparameter _tuning_objective_metric value validation:auc to Json.
Returning the value itself
INFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.
Returning the value itself
INFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)
INFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode
INFO:root:Determined delimiter of CSV input is ','
INFO:root:Determined delimiter of CSV input is ','
INFO:root:Determined delimiter of CSV input is ','
[02:21:16] 2333x11 matrix with 25663 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
INFO:root:Determined delimiter of CSV input is ','
[02:21:16] 666x11 matrix with 7326 entries loaded from /opt/ml/input/data/validation?format=csv&label_column=0&delimiter=,
INFO:root:Single node training.
INFO:root:Setting up HPO optimized metric to be : auc
INFO:root:Train matrix has 2333 rows
INFO:root:Validation matrix has 666 rows
[0]#011train-auc:0.846269#011validation-auc:0.786838
[1]#011train-auc:0.903795#011validation-auc:0.848128
[2]#011train-auc:0.925178#011validation-auc:0.861347
[3]#011train-auc:0.933536#011validation-auc:0.868018
[4]#011train-auc:0.933536#011validation-auc:0.868018
[5]#011train-auc:0.94208#011validation-auc:0.876357
[6]#011train-auc:0.942367#011validation-auc:0.877518
[7]#011train-auc:0.948978#011validation-auc:0.888694
[8]#011train-auc:0.95148#011validation-auc:0.89498
[9]#011train-auc:0.95148#011validation-auc:0.89498
[10]#011train-auc:0.95148#011validation-auc:0.89498
[11]#011train-auc:0.95148#011validation-auc:0.89498
[12]#011train-auc:0.95148#011validation-auc:0.89498
[13]#011train-auc:0.955478#011validation-auc:0.893623
[14]#011train-auc:0.957692#011validation-auc:0.895078
[15]#011train-auc:0.957692#011validation-auc:0.895078
[16]#011train-auc:0.958562#011validation-auc:0.887508
[17]#011train-auc:0.958562#011validation-auc:0.887508
[18]#011train-auc:0.958562#011validation-auc:0.887508
[19]#011train-auc:0.958562#011validation-auc:0.887508
[20]#011train-auc:0.958081#011validation-auc:0.887745
[21]#011train-auc:0.958081#011validation-auc:0.887745
[22]#011train-auc:0.960038#011validation-auc:0.881066
[23]#011train-auc:0.960038#011validation-auc:0.881066
[24]#011train-auc:0.959638#011validation-auc:0.897163
[25]#011train-auc:0.965398#011validation-auc:0.893697
[26]#011train-auc:0.965398#011validation-auc:0.893697
[27]#011train-auc:0.965398#011validation-auc:0.893697
[28]#011train-auc:0.965398#011validation-auc:0.893697
[29]#011train-auc:0.965398#011validation-auc:0.893697
[30]#011train-auc:0.965398#011validation-auc:0.893697
[31]#011train-auc:0.965398#011validation-auc:0.893697
[32]#011train-auc:0.968249#011validation-auc:0.895013
[33]#011train-auc:0.968249#011validation-auc:0.895013
[34]#011train-auc:0.968249#011validation-auc:0.895013
[35]#011train-auc:0.968249#011validation-auc:0.895013
[36]#011train-auc:0.968249#011validation-auc:0.895013
[37]#011train-auc:0.968249#011validation-auc:0.895013
[38]#011train-auc:0.968249#011validation-auc:0.895013
[39]#011train-auc:0.968249#011validation-auc:0.895013
[40]#011train-auc:0.968249#011validation-auc:0.895013
[41]#011train-auc:0.968249#011validation-auc:0.895013
[42]#011train-auc:0.968249#011validation-auc:0.895013
[43]#011train-auc:0.968249#011validation-auc:0.895013
[44]#011train-auc:0.968249#011validation-auc:0.895013
[45]#011train-auc:0.968249#011validation-auc:0.895013
[46]#011train-auc:0.968249#011validation-auc:0.895013
[47]#011train-auc:0.968249#011validation-auc:0.895013
[48]#011train-auc:0.968249#011validation-auc:0.895013
[49]#011train-auc:0.968249#011validation-auc:0.895013
[50]#011train-auc:0.968249#011validation-auc:0.895013
[51]#011train-auc:0.968249#011validation-auc:0.895013
[52]#011train-auc:0.968249#011validation-auc:0.895013
[53]#011train-auc:0.968249#011validation-auc:0.895013
[54]#011train-auc:0.968249#011validation-auc:0.895013
[55]#011train-auc:0.968249#011validation-auc:0.895013
Training seconds: 62
Billable seconds: 62
-------------!

Evaluate and compare models

In [70]:
xgb_predictor_updated.content_type = 'text/csv'
xgb_predictor_updated.serializer = csv_serializer
xgb_predictor_updated.deserializer = None

predictions_updated = predict(test_data_updated.values[:, 1:], xgb_predictor_updated)
pd.crosstab(index = test_data_updated.iloc[:, 0], 
            columns = np.round(predictions_updated), 
            rownames = ['actual'],
            colnames = ['predictions'])
Out[70]:
predictions 0.0 1.0
actual
0 281 5
1 14 34

For our run, the new model actually performs slightly better. And, as we're using far fewer features, we'll need to pass less data for each prediction request.

Let's look at some metrics for our two models.

In [72]:
thresh = 0.5
y_pred_updated = predictions_updated
y_true_updated = test_data_updated['churn_True.']

print('Updated model:')
collect_eval_metrics(y_true_updated, y_pred_updated, thresh)
print('Original model:')
collect_eval_metrics(y_true, y_pred, thresh)
Updated model:
Out[72]:
auc accuracy recall precision
0 0.9215 0.9431 0.7083 0.8718
Original model:
Out[72]:
auc accuracy recall precision
0 0.9362 0.9461 0.75 0.8571

In our runs the two models had very close results, with slight shifts; sometimes one had better precision than the other but worse recall, and with very similar accuracies.

Another way to compare the two models is via a ROC curve. That will show how the AUC (area under the curve) differs between the two models.

In [73]:
from sklearn import metrics
fpr1, tpr1, _ = metrics.roc_curve(y_true, y_pred)
fpr2, tpr2, _ = metrics.roc_curve(y_true_updated, y_pred_updated)
auc_title = plt.title("ROC Curve")
auc_full_model = plt.plot(fpr1, tpr1,
                          color = 'blue',
                          label = "full model")
auc_updated_model = plt.plot(fpr2, tpr2,
                             color = 'green',
                             label = "updated model")
auc_legend = plt.legend(loc = 'lower right')
random_guess = plt.plot([0,1],[0,1],'r--')
xlim = plt.xlim([-0.1,1.1])
ylim = plt.ylim([-0.1,1.1])
ylabel = plt.ylabel('True Positive Rate')
xlabel = plt.xlabel('False Positive Rate')
plt.show()
In [74]:
Markdown(f"""
The updated model provides results very close to the original model; on some runs slightly better, on others, 
slightly worse. 

As the difference between the two models is below our margin of error, and the updated model requires 
far less data to be passed in the call ({ser1[ser1 != 0].size} values rather than 70), we will use the 
updated model.

Lastly, here's the list of columns used for the final model. This list will be used to create the 
SQL function that calls the model for inference. Creating that will be the task of the Part 3 notebook.
""")
Out[74]:

The updated model provides results very close to the original model; on some runs slightly better, on others, slightly worse.

As the difference between the two models is below our margin of error, and the updated model requires far less data to be passed in the call (11 values rather than 70), we will use the updated model.

Lastly, here's the list of columns used for the final model. This list will be used to create the SQL function that calls the model for inference. Creating that will be the task of the Part 3 notebook.

In [68]:
print(test_data_updated.columns.tolist()[1:])
['acc_length', 'vmail_msg', 'day_mins', 'day_calls', 'eve_mins', 'night_mins', 'night_calls', 'int_calls', 'int_charge', 'cust_service_calls', 'int_plan_no']

Assessing business impact

We can also assess the model performance by looking at the prediction scores and refining the threshold used to decide if someone is a churner. While it’s usual to treat this as a binary classification (‘1’ or ‘0’), in fact, the real world is less binary: people become “likely to churn” for some time before they actually churn. Loss of “brand loyalty” occurs some time before someone actually buys from a competitor.

Let's begin by looking at the predicted churners.

In [75]:
pd.crosstab(index=test_data_updated.iloc[:, 0], columns=np.round(predictions_updated), rownames=['actual'], colnames=['predictions'])
Out[75]:
predictions 0.0 1.0
actual
0 281 5
1 14 34

Note, due to randomized elements of the algorithm, your results may differ slightly.

Of the 48 churners, we've correctly predicted 34 of them (true positives). And, we incorrectly predicted 5 customers would churn who then ended up not doing so (false positives). There are also 14 customers who ended up churning, that we predicted would not (false negatives).

An important point here is that because of the np.round() function above we are using a simple threshold (or cutoff) of 0.5. Our predictions from xgboost come out as continuous values between 0 and 1 and we force them into the binary classes that we began with. However, because a customer that churns is expected to cost the company more than proactively trying to retain a customer who we think might churn, we should consider adjusting this cutoff. That will almost certainly increase the number of false positives, but it can also be expected to increase the number of true positives and reduce the number of false negatives.

To get a rough intuition here, let's look at the continuous values of our predictions.

In [71]:
predict_hist = plt.hist(predictions_updated, color = "#4daf4a")
plt.xlabel('Prediction')
plt.ylabel('Number of predictions')
plt.title('Prediction Score Distribution')
plt.show()
Out[71]:
Text(0.5, 0, 'Prediction')
Out[71]:
Text(0, 0.5, 'Number of predictions')
Out[71]:
Text(0.5, 1.0, 'Prediction Score Distribution')

The continuous valued predictions coming from our model tend to skew toward 0 or 1, but there is sufficient mass between 0.1 and 0.9 that adjusting the cutoff should indeed shift a number of customers' predictions. For example...

In [76]:
pd.crosstab(index=test_data.iloc[:, 0], columns=np.where(predictions > 0.3, 1, 0))
Out[76]:
col_0 0 1
churn_True.
0 272 14
1 9 39

We can see that changing the cutoff from 0.5 to 0.3 results in 5 more true positives, 9 more false positives, and 5 fewer false negatives. The numbers are small overall here, but that's 6-10% of customers overall that are shifting because of a change to the cutoff. Was this the right decision? We may end up retaining 9 extra customers, but we also unnecessarily incentivized 5 more customers who would have stayed. Determining optimal cutoffs is a key step in properly applying machine learning in a real-world setting. Let's discuss this more broadly and then apply a specific, hypothetical solution for our current problem.

Relative cost of errors

Any practical binary classification problem is likely to produce a similarly sensitive cutoff. That by itself isn’t a problem. After all, if the scores for two classes are really easy to separate, the problem probably isn’t very hard to begin with and might even be solvable with simple rules instead of ML.

More important, if I put an ML model into production, there are costs associated with the model erroneously assigning false positives and false negatives. I also need to look at similar costs associated with correct predictions of true positives and true negatives. Because the choice of the cutoff affects all four of these statistics, I need to consider the relative costs to the business for each of these four outcomes for each prediction.

Assigning costs

What are the costs for our problem of mobile operator churn? The costs, of course, depend on the specific actions that the business takes. Let's make some assumptions here.

First, we'll assign the true negatives the cost of \$0. Our model essentially correctly identified a happy customer in this case, and we won’t offer them an incentive. An alternative is to assign the true negatives the actual value of the customer's spend, as this is the customer's contribution to our overall revenue.

False negatives are the most problematic, because they incorrectly predict that a churning customer will stay. We lose the customer and will have to pay all the costs of acquiring a replacement customer, including foregone revenue, advertising costs, administrative costs, point of sale costs, and likely a phone hardware subsidy. Our marketing department should be able to give us a value to use here for the overhead, and we have the actual customer spend for each customer in our dataset.

In the meantime, a quick search on the Internet reveals that such costs typically run in the hundreds of dollars so, for the right now, let's assume $500. This is the cost we'll use for each false negative.

Finally, for customers that our model identifies as churning, we'll be giving them an incentive. At this point Marketing has not yet told us the incentives they'd like to use, so let's assume a retention incentive in the amount of \$50. This is the cost we'll apply to both true positive and false positive outcomes. In the case of false positives (the customer is happy, but the model mistakenly predicted churn), we will “waste” the concession. We probably could have spent those dollars more effectively, but it's possible we increased the loyalty of an already loyal customer, so that’s not so bad.

Once we have the new incentive programs from Marketing, we can rerun this analysis, using the actual customer spend and the planned incentives to calculate the cost of each prediction. That will allow us to provide an economic model of the effect of the proposed incentive program when it's put into production along with the predictions from this model.

Finding the optimal threshold

It’s clear that false negatives are substantially more costly than false positives. Instead of optimizing for error based on the number of customers, we should be minimizing a cost function that looks like this:

txt
cost_of_replacing_customer * FN(C) + customer_value * TN(C) + incentive_offered * FP(C) + incentive_offered * TP(C)

FN(C) means that the false negative percentage is a function of the cutoff, C, and similar for TN, FP, and TP. We need to find the cutoff, C, where the result of the expression is smallest.

For even better outcomes, we could even offer a range of incentives to customers that meet different criteria. For example, it's worth more to the business to prevent a high spend customer from churning than a low spend customer. We could also target the "grey area" of customers that have less loyalty and could be swayed by another company's advertising. This blog and this one provide some examples and ideas on how this can be accomplished.

Right now we'll start by using the same values for all customers, to give us a starting point for discussion with the business. With the estimates we'll use for right now, this becomes:

txt
$500 * FN(C) + $0 * TN(C) + $50 * FP(C) + $50 * TP(C)

A straightforward way to do this, is to simply run a simulation over a large number of possible cutoffs. We test 100 possible values in the for loop below.

In [86]:
import matplotlib.ticker as ticker
cutoffs = np.arange(0.01, 1, 0.01)
costs = []
fn = 500
tn = 0
fp = 50
tp = 50
for c in cutoffs:
    costs.append(np.sum(np.sum(np.array([[tn, tp], [fn, fp]]) * 
                               pd.crosstab(index=test_data_updated.iloc[:, 0], 
                                           columns=np.where(predictions_updated > c, 1, 0)))))

costs = np.array(costs)
fig, ax = plt.subplots(1, 1)
plt.plot(cutoffs, costs)
fmt = '${x:,.0f}'
tick = ticker.StrMethodFormatter(fmt)
ax.yaxis.set_major_formatter(tick) 
plt.xlabel('Threshold')
plt.ylabel('Cost')
plt.title('Threshold versus Cost')
plt.show()
print('Cost is minimized near a cutoff of:', cutoffs[np.argmin(costs)], 'for a cost of: $', np.min(costs))
Out[86]:
[<matplotlib.lines.Line2D at 0x7f4478364a20>]
Out[86]:
Text(0.5, 0, 'Threshold')
Out[86]:
Text(0, 0.5, 'Cost')
Out[86]:
Text(0.5, 1.0, 'Threshold versus Cost')
Cost is minimized near a cutoff of: 0.12 for a cost of: $ 5950

The above chart shows how picking a threshold too low results in costs skyrocketing as all customers are given a retention incentive. Meanwhile, setting the threshold too high (e.g., 0.7 or above) results in too many lost customers, which ultimately grows to be nearly as costly. In between, there is a large "grey" area, where perhaps some more nuanced incentives would create better outcomes.

The overall cost can be minimized at \$5950 by setting the cutoff to 0.12, which is substantially better than the \\$20k+ I would expect to lose by not taking any action.


Summary

Now, we have a working model, with an endpoint we can call that will tell us whether a specific customer is predicted to churn.

There are a number of techniques that can be used to provide deeper explanations of this model, and of specific predictions; for example, popular techniques include SHAP and LIME.

There are also a number of opportunities to further improve this model. For example, by evaluating the model based on the economic cost of losing versus keeping a customer, we can shift the model's error rates between false positives and false negatives. We can also optimize how much we're willing to spend in order to keep a customer. Look at the following blogs for ideas:

Other means of extending it include:

  • Some customers who receive retention incentives will still churn. Including a probability of churning despite receiving an incentive in our cost function would provide a better ROI on our retention programs.
  • Customers who switch to a lower-priced plan or who deactivate a paid feature represent different kinds of churn that could be modeled separately.
  • Modeling the evolution of customer behavior. If usage is dropping and the number of calls placed to Customer Service is increasing, you are more likely to experience churn then if the trend is the opposite. A customer profile should incorporate behavior trends.
  • Actual training data and monetary cost assignments could be more complex.
  • Multiple models for each type of churn could be needed.

Regardless of additional complexity, similar principles described in this notebook are likely apply.

Next Steps

Now, move on to the next (and last) notebook - Part 3. In Part 3, you'll connect the updated endpoint you just created to Aurora, and call it from your 'production workflow'.


Clean-up (AFTER you've run the Part 3 notebook)

If you're ready to be done with this use case, please run the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

However if you remove it before you've completed Part 3, you will not be able to make inferences against the endpoint!

sagemaker.Session().delete_endpoint(xgb_predictor.endpoint) sagemaker.Session().delete_endpoint(xgb_predictor_updated.endpoint)