Model development: customer churn prediction

The purpose of this notebook is to reproducibly and defensibly document the development of a model to predict customer churn at Two Providores. EDA and production are not documented in this notebook.

Target feature definition

Two Providores is a business-to-business food distributor. Because the company records all transactions against customer accounts, the start of the customer relationship is clearly defined. For completeness and relevance, this model includes only those customers who started purchasing from the business after 1 January 2008.

The end of the customer relationship, by contrast, is not always so clearly defined. In this model, customer 'churn' is defined probabilistically: when the interval between the last invoice and the present date is greater than the 99th percentile of the corresponding exponential probability distribution function.

It is worth noting here that instances of 'involuntary' customer churn—for example where a business closed, defaulted on payment, or became insolvent—have been removed from the dataset on which this model is trained.

Feature set selection

The possible feature set for model training was significantly restricted due to longstanding organisational practices at Two Providores of deleting data associated with old suppliers, products, customers, and customer contacts. One attribute that remains—customer pricing group—is strongly correlated with the time at which the customer relationship started, and has been excluded from the model to prevent overfitting on the training dataset.

Thus, for this model, the feature set consisted of:

  • qualitative customer attributes, namely business category
  • temporal customer attributes, namely customer age
  • quantitative transaction attributes, namely transaction value and number of items
  • temporal transaction attributes, namely transaction frequency

For the temporal and quantitative attributes described above, these features were measured at a range of different intervals (at the start, on average, and most recently) for each customer, and the most relevant intervals where selected inductively from various model results (more below).

Data sources

The model is trained on a rectangular dataset built in several steps in order to reduce computational expenses:

  • customer ID and metadata (viz. business category) is pulled from SQL Server and cached in the allCustomerData dataframe
  • transaction history is pulled from SQL Server and cached in the salesData dataframe
  • transaction history is aggregated and cached in the invoices dataframe
  • the customers dataframe is built from the allCustomerData dataframe and successive transformations of data from the invoices dataframe
In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
# SQL pulls

import pyodbc
import pandas as pd

connectionString = """DRIVER={SQL Server};
                      SERVER=****;
                      DATABASE=****;
                      UID=****;
                      PWD=****"""

allCustomerDataQuery = """SELECT RTRIM(Customers.UniqueID) AS CustomerID
                            , RTRIM(Customers.CustomerTitle) AS CustomerTitle
                            , RTRIM(Customers.ZCCategory) AS BusinessCategory
                            FROM Customers;"""
conn = pyodbc.connect(connectionString)
allCustomerData = pd.read_sql_query(allCustomerDataQuery, conn)
conn.close()

salesDataQuery = """SET NOCOUNT ON;

                    DECLARE @Customers table(
                      CustomerID char(17)
                    );
                    
                    INSERT INTO @Customers
                    SELECT AccountID
                    FROM (
                      SELECT DISTINCT RTRIM(TransHeaders.AccountID) AS AccountID
                        , RTRIM(Customers.CustomerTitle) AS [Customer title]
                        , MIN(TransHeaders.TradingDate) OVER (PARTITION BY TransHeaders.AccountID) AS [First sale]
                        , MAX(TransHeaders.TradingDate) OVER (PARTITION BY TransHeaders.AccountID) AS [Last sale]
                      FROM TransHeaders
                        INNER JOIN Customers ON Customers.UniqueID = TransHeaders.AccountID
                      WHERE TransHeaders.TransactionType IN ('CI', 'CX')
                        AND Customers.ZCCategory != 'Staff'
                        AND (Customers.ZReasonInactive IS NULL
                          OR Customers.ZReasonInactive = 'Customer stopped buying')
                    ) AS DistinctCustomers
                    WHERE [First sale] >= CONVERT(date, '2008-01-01')
                    ORDER BY AccountID;
                    
                    SELECT RTRIM(TransHeaders.AccountID) AS CustomerID
                      , RTRIM(TransDetails.ItemAcID) AS ProductID
                      , TransDetails.LocalGross * -1 AS LocalGross
                      , RTRIM(TransHeaders.TransactionID) AS TransactionID
                      , TransHeaders.TradingDate
                    FROM TransDetails
                      INNER JOIN TransHeaders ON TransHeaders.TransactionID = TransDetails.TransactionID
                        AND TransHeaders.CurrentVersionNo = TransDetails.CurrentVersionNo
                    WHERE TransHeaders.TransactionType IN ('CI', 'CX')
                      AND TransHeaders.AccountID IN (
                        SELECT CustomerID FROM @Customers
                      );"""
conn = pyodbc.connect(connectionString)
salesData = pd.read_sql_query(salesDataQuery, conn)
conn.close()
In [3]:
# transaction data aggregated and cached in invoices dataframe 

import numpy as np

indexLevels = (
        salesData[['CustomerID', 'TradingDate']]
        .drop_duplicates()
        .sort_values(['CustomerID', 'TradingDate'])
        )
invoices = (
        pd.DataFrame(index=[indexLevels['CustomerID'], indexLevels['TradingDate']])
        )
invoices['LocalGross'] = (
        salesData.sort_values('TradingDate')
        .groupby(['CustomerID', 'TradingDate'])['LocalGross']
        .aggregate(np.sum)
        )
invoices['noSKUs'] = (
        salesData.sort_values('TradingDate')
        .groupby(['CustomerID', 'TradingDate'])['ProductID']
        .nunique()
        )
invoices = invoices.reset_index(level=1)
invoices['DaysBetweenInvoices'] = (
        (invoices['TradingDate'] - invoices.groupby(invoices.index)['TradingDate'].shift(+1))
        .dt.days
        )
In [4]:
# customers dataframe built from allCustomerData and invoices dataframes

customers = (
        pd.DataFrame(index=
            salesData['CustomerID']
            .drop_duplicates()
            .sort_values()
            .values)
        .merge(allCustomerData, how='inner', right_on='CustomerID', left_index=True)
        .set_index('CustomerID')
        )
customers['firstSale'] = (
        invoices.groupby(invoices.index)['TradingDate']
        .aggregate(np.min)
        )
customers['lastSale'] = (
        invoices.groupby(invoices.index)['TradingDate']
        .aggregate(np.max)
        )
customers['daysLifetime'] = (
        ((customers['lastSale'] - customers['firstSale']).dt.days) + 1
        )
customers['noDaysInvoiced'] = (
        invoices.groupby(invoices.index)['TradingDate']
        .nunique()
        )

Data transformation

As described above, this model was developed iteratively:

  • from hypotheses about customer churn behaviour, numerous possible feature were identified
  • from raw data, an excess of possible features was produced agnostically by reproducing summary statistics over a sequence of parameters
  • from these features, the most important features were selected inductively from model testing

To facilitate the production of an excess of features from raw data, the following two functions are used extensively to produce sequences of summary statistics.

In [5]:
def averageLastN(n, column):
    return (invoices
            .groupby(invoices.index)
            .apply(pd.DataFrame.nlargest, n, 'TradingDate')[column]
            .groupby(level=0)
            .aggregate(np.nanmean))
def averageFirstN(n, column):
    return (invoices
            .groupby(invoices.index)
            .apply(pd.DataFrame.nsmallest, n, 'TradingDate')[column]
            .groupby(level=0)
            .aggregate(np.nanmean))
In [6]:
# customer ordering frequency and temporal changes in frequency
customers['avgDaysBetweenInvoices'] = (
         invoices.groupby(invoices.index)['DaysBetweenInvoices']
         .aggregate(np.mean)
        )
customers['avgDaysBetweenInvoices'] = (
        customers.groupby('BusinessCategory')['avgDaysBetweenInvoices']
        .transform(lambda x: x.fillna(x.mean()))
        )
customers['SDDaysBetweenInvoices'] = (
        invoices.groupby(invoices.index)['DaysBetweenInvoices']
        .aggregate(np.std)
        )
customers['intervalLast1'] = averageLastN(1, 'DaysBetweenInvoices')
customers['intervalLast2'] = averageLastN(2, 'DaysBetweenInvoices')
customers['intervalLast3'] = averageLastN(3, 'DaysBetweenInvoices')
customers['intervalLast4'] = averageLastN(4, 'DaysBetweenInvoices')
customers['intervalLast5'] = averageLastN(5, 'DaysBetweenInvoices')
customers['intervalLast6'] = averageLastN(6, 'DaysBetweenInvoices')
customers['intervalChangeL1'] = customers['intervalLast1'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeL2'] = customers['intervalLast2'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeL3'] = customers['intervalLast3'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeL4'] = customers['intervalLast4'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeL5'] = customers['intervalLast5'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeL6'] = customers['intervalLast6'] / customers['avgDaysBetweenInvoices']
customers['intervalFirst2'] = averageFirstN(2, 'DaysBetweenInvoices')
customers['intervalFirst3'] = averageFirstN(3, 'DaysBetweenInvoices')
customers['intervalFirst4'] = averageFirstN(4, 'DaysBetweenInvoices')
customers['intervalFirst5'] = averageFirstN(5, 'DaysBetweenInvoices')
customers['intervalFirst6'] = averageFirstN(6, 'DaysBetweenInvoices')
customers['intervalChangeF2'] = customers['intervalFirst2'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeF3'] = customers['intervalFirst3'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeF4'] = customers['intervalFirst4'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeF5'] = customers['intervalFirst5'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeF6'] = customers['intervalFirst6'] / customers['avgDaysBetweenInvoices']
In [7]:
# customer basket size and temporal changes in size
customers['avgNoSkusPerDay'] = (
        invoices.groupby(invoices.index)['noSKUs']
        .aggregate(np.average)
        )
customers['SDNoSkus'] = (
        invoices.groupby(invoices.index)['noSKUs']
        .aggregate(np.std)
        )
customers['skusLast1'] = averageLastN(1, 'noSKUs')
customers['skusLast2'] = averageLastN(2, 'noSKUs')
customers['skusLast3'] = averageLastN(3, 'noSKUs')
customers['skusLast4'] = averageLastN(4, 'noSKUs')
customers['skusLast5'] = averageLastN(5, 'noSKUs')
customers['skusLast6'] = averageLastN(6, 'noSKUs')
customers['skusChangeL1'] = customers['skusLast1'] / customers['avgNoSkusPerDay']
customers['skusChangeL2'] = customers['skusLast2'] / customers['avgNoSkusPerDay']
customers['skusChangeL3'] = customers['skusLast3'] / customers['avgNoSkusPerDay']
customers['skusChangeL4'] = customers['skusLast4'] / customers['avgNoSkusPerDay']
customers['skusChangeL5'] = customers['skusLast5'] / customers['avgNoSkusPerDay']
customers['skusChangeL6'] = customers['skusLast6'] / customers['avgNoSkusPerDay']
customers['skusFirst1'] = averageFirstN(1, 'noSKUs')
customers['skusFirst2'] = averageFirstN(2, 'noSKUs')
customers['skusFirst3'] = averageFirstN(3, 'noSKUs')
customers['skusFirst4'] = averageFirstN(4, 'noSKUs')
customers['skusFirst5'] = averageFirstN(5, 'noSKUs')
customers['skusFirst6'] = averageFirstN(6, 'noSKUs')
customers['skusChangeF1'] = customers['skusFirst1'] / customers['avgNoSkusPerDay']
customers['skusChangeF2'] = customers['skusFirst2'] / customers['avgNoSkusPerDay']
customers['skusChangeF3'] = customers['skusFirst3'] / customers['avgNoSkusPerDay']
customers['skusChangeF4'] = customers['skusFirst4'] / customers['avgNoSkusPerDay']
customers['skusChangeF5'] = customers['skusFirst5'] / customers['avgNoSkusPerDay']
customers['skusChangeF6'] = customers['skusFirst6'] / customers['avgNoSkusPerDay']
In [8]:
# customer basket value and temporal changes in value
customers['avgValuePerDay'] = (
        invoices.groupby(invoices.index)['LocalGross']
        .aggregate(np.average)
        )
customers['SDValue'] = (
        invoices.groupby(invoices.index)['LocalGross']
        .aggregate(np.std)
        )
customers['valueLast1'] = averageLastN(1, 'LocalGross')
customers['valueLast2'] = averageLastN(2, 'LocalGross')
customers['valueLast3'] = averageLastN(3, 'LocalGross')
customers['valueLast4'] = averageLastN(4, 'LocalGross')
customers['valueLast5'] = averageLastN(5, 'LocalGross')
customers['valueLast6'] = averageLastN(6, 'LocalGross')
customers['valueChangeL1'] = customers['valueLast1'] / customers['avgValuePerDay']
customers['valueChangeL2'] = customers['valueLast2'] / customers['avgValuePerDay']
customers['valueChangeL3'] = customers['valueLast3'] / customers['avgValuePerDay']
customers['valueChangeL4'] = customers['valueLast4'] / customers['avgValuePerDay']
customers['valueChangeL5'] = customers['valueLast5'] / customers['avgValuePerDay']
customers['valueChangeL6'] = customers['valueLast6'] / customers['avgValuePerDay']
customers['valueFirst1'] = averageFirstN(1, 'LocalGross')
customers['valueFirst2'] = averageFirstN(2, 'LocalGross')
customers['valueFirst3'] = averageFirstN(3, 'LocalGross')
customers['valueFirst4'] = averageFirstN(4, 'LocalGross')
customers['valueFirst5'] = averageFirstN(5, 'LocalGross')
customers['valueFirst6'] = averageFirstN(6, 'LocalGross')
customers['valueChangeF1'] = customers['valueFirst1'] / customers['avgValuePerDay']
customers['valueChangeF2'] = customers['valueFirst2'] / customers['avgValuePerDay']
customers['valueChangeF3'] = customers['valueFirst3'] / customers['avgValuePerDay']
customers['valueChangeF4'] = customers['valueFirst4'] / customers['avgValuePerDay']
customers['valueChangeF5'] = customers['valueFirst5'] / customers['avgValuePerDay']
customers['valueChangeF6'] = customers['valueFirst6'] / customers['avgValuePerDay']

Target feature definition

The target feature, 'churn', is defined probabilistically: when the interval between the last invoice and today is greater than the 99th percentile of the corresponding exponential probability distribution function.

In [9]:
customers['churned'] = pd.Timestamp.today() > (
        customers['lastSale'] + pd.to_timedelta(((-1 * customers['avgDaysBetweenInvoices']) * np.log(0.01)), unit='d')
        )

Subsample definition

Given the business's operating context, the model is restricted to predicting churn amongst customers who have received more than two orders.

In [10]:
modelData = customers[customers['noDaysInvoiced'] > 2][[
        'churned',
        'daysLifetime', 'noDaysInvoiced',
        'avgDaysBetweenInvoices',
        'intervalLast1', 'intervalLast2', 'intervalLast3',
        'intervalLast4', 'intervalLast5', 'intervalLast6',
        'intervalFirst2', 'intervalFirst3',
        'intervalFirst4', 'intervalFirst5', 'intervalFirst6',
        'intervalChangeL1', 'intervalChangeL2', 'intervalChangeL3',
        'intervalChangeL4', 'intervalChangeL5', 'intervalChangeL6',
        'intervalChangeF2', 'intervalChangeF3',
        'intervalChangeF4', 'intervalChangeF5','intervalChangeF6',
        'avgNoSkusPerDay',
        'skusLast1', 'skusLast2', 'skusLast3',
        'skusLast3', 'skusLast4', 'skusLast6',
        'skusFirst1', 'skusFirst2', 'skusFirst3',
        'skusFirst3', 'skusFirst4', 'skusFirst6',
        'skusChangeL1', 'skusChangeL2', 'skusChangeL3',
        'skusChangeL4', 'skusChangeL5', 'skusChangeL6',
        'skusChangeF1', 'skusChangeF2', 'skusChangeF3',
        'skusChangeF4', 'skusChangeF5', 'skusChangeF6',
        'avgValuePerDay',
        'valueLast1', 'valueLast2', 'valueLast3',
        'valueLast4', 'valueLast5', 'valueLast6',
        'valueFirst1', 'valueFirst2', 'valueFirst3',
        'valueFirst4', 'valueFirst5', 'valueFirst6',
        'valueChangeL1', 'valueChangeL2', 'valueChangeL3',
        'valueChangeL4', 'valueChangeL5', 'valueChangeL6',
        'valueChangeF1', 'valueChangeF2', 'valueChangeF3',
        'valueChangeF4', 'valueChangeF5', 'valueChangeF6',
        'BusinessCategory']]

Data preprocessing

In [11]:
modelData = modelData.fillna(1) # intervals greater than n (average / average = 1)
dummyData = pd.get_dummies(modelData)
data_features = dummyData.drop('churned', axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
        data_features, dummyData['churned'].values, random_state=0)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Parameter tuning and model selection

In [12]:
from sklearn.model_selection import GridSearchCV
In [13]:
from sklearn.svm import SVC
param_grid = {'C' : [2, 3, 4],
              'gamma' : [0.001, 0.01, 0.1]}
grid_search = GridSearchCV(SVC(kernel='rbf', random_state=0), param_grid, cv=5,
                           return_train_score=True)
grid_search.fit(X_train_scaled, y_train)
print("Test set score: {:.2f}".format(grid_search.score(X_test_scaled, y_test)))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

svm = (
    SVC(kernel='rbf', C=3, gamma=0.01, random_state=0)
    .fit(X_train_scaled, y_train)
    )
Test set score: 0.73
Best parameters: {'C': 3, 'gamma': 0.01}
Best cross-validation score: 0.72
In [14]:
# random forest
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators' : [4000, 5000, 6000]}
grid_search = GridSearchCV(RandomForestClassifier(random_state=0), param_grid, cv=5,
                           return_train_score=True)
grid_search.fit(X_train_scaled, y_train)
print("Test set score: {:.2f}".format(grid_search.score(X_test_scaled, y_test)))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

forest = (
    RandomForestClassifier(n_estimators=5000, random_state=0)
    .fit(X_train_scaled, y_train)
    )
Test set score: 0.74
Best parameters: {'n_estimators': 5000}
Best cross-validation score: 0.74
In [15]:
# gradient boosted regression tree
from sklearn.ensemble import GradientBoostingClassifier
param_grid = {'n_estimators' : [400, 600, 800],
              'max_depth' : [2, 3, 4],
              'learning_rate' : [0.100, 0.125, 0.150]}
grid_search = GridSearchCV(GradientBoostingClassifier(random_state=0), param_grid, cv=5,
                           return_train_score=True)
grid_search.fit(X_train_scaled, y_train)
print("Test set score: {:.2f}".format(grid_search.score(X_test_scaled, y_test)))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

gbrt = (
    GradientBoostingClassifier(n_estimators=400, max_depth=2, learning_rate=0.1, random_state=0)
    .fit(X_train_scaled, y_train)
    )
Test set score: 0.74
Best parameters: {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 400}
Best cross-validation score: 0.74
In [16]:
# compare area under receiver operating characteristics curves
from sklearn.metrics import roc_auc_score
svm_auc = roc_auc_score(y_test, svm.decision_function(X_test_scaled))
forest_auc = roc_auc_score(y_test, forest.predict_proba(X_test_scaled)[:, 1])
gbrt_auc = roc_auc_score(y_test, gbrt.predict_proba(X_test_scaled)[:, 1])
print("AUC for Support Vector Machine: {:.3f}".format(svm_auc))
print("AUC for Random Forest: {:.3f}".format(forest_auc))
print("AUC for Gradient Bossted Regression Tree: {:.3f}".format(gbrt_auc))
AUC for Support Vector Machine: 0.746
AUC for Random Forest: 0.758
AUC for Gradient Bossted Regression Tree: 0.762

Feature importances

Three models were used to inductively select features from the excess of features tested:

  • lasso regression
  • random forest
  • gradient boosted regression tree

Using the results of these models, a smaller feature could be extracted for:

  • reduced computational expense
  • accelerated training time
  • reduced risk of over-fitting on the training dataset
In [17]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.6, max_iter=100000, random_state=0).fit(X_train, y_train)

FeatureImportancesLasso = (
        pd.DataFrame(data={'Absolute coefficient' : abs(lasso.coef_[lasso.coef_ != 0])},
                     index=data_features.columns.values[lasso.coef_ != 0])
        ).sort_values('Absolute coefficient', ascending=False)
FeatureImportancesLasso
Out[17]:
Absolute coefficient
avgDaysBetweenInvoices 0.001525
intervalLast1 0.000505
valueLast3 0.000221
valueLast6 0.000158
daysLifetime 0.000156
avgValuePerDay 0.000105
intervalFirst6 0.000060
valueFirst6 0.000029
valueFirst1 0.000028
noDaysInvoiced 0.000012
valueFirst3 0.000009
valueFirst2 0.000006
In [18]:
FeatureImportancesForest = pd.DataFrame(data={'Importance' : forest.feature_importances_},
             index=data_features.columns.values).sort_values('Importance', ascending=False)
FeatureImportancesForest.head(15)
Out[18]:
Importance
daysLifetime 0.036734
valueLast1 0.036690
valueLast2 0.028670
BusinessCategory_ 0.025009
valueLast3 0.021892
valueChangeL2 0.020484
valueLast6 0.018640
valueLast4 0.018122
valueChangeL3 0.016776
valueChangeL1 0.016649
intervalChangeL1 0.016649
valueLast5 0.016203
intervalChangeL5 0.015332
intervalChangeL2 0.015105
noDaysInvoiced 0.014767
In [19]:
FeatureImportancesGBRT = pd.DataFrame(data={'Importance' : gbrt.feature_importances_},
             index=data_features.columns.values).sort_values('Importance', ascending=False)
FeatureImportancesGBRT.head(15)
Out[19]:
Importance
valueLast1 0.098200
daysLifetime 0.089764
BusinessCategory_ 0.082288
valueLast2 0.038418
intervalChangeL6 0.030150
intervalChangeL1 0.029289
valueFirst2 0.024022
valueLast6 0.022742
valueChangeL3 0.022385
skusChangeL5 0.022331
avgDaysBetweenInvoices 0.022030
valueChangeL2 0.017458
intervalChangeL2 0.017151
skusChangeF2 0.016963
intervalChangeF2 0.016738

Validation with selected features

Using the results above, the random forest model was re-trained on a significantly smaller feature set—46 vs. 102 features, including one-hot encoded features—without a loss in accuracy.

In [20]:
modelData = customers[customers['noDaysInvoiced'] > 2][[
        'churned',
        'daysLifetime', 'noDaysInvoiced',
        'avgDaysBetweenInvoices',
        'intervalLast1',
        'intervalFirst3', 'intervalFirst6',
        'intervalChangeL1', 'intervalChangeL6',
        'avgNoSkusPerDay',
        'skusChangeF1', 'skusChangeL1',
        'avgValuePerDay',
        'valueLast1', 'valueLast3', 'valueLast6',
        'valueFirst1', 'valueFirst6',
        'valueChangeL1', 'valueChangeL3',
        'BusinessCategory']]
modelData = modelData.fillna(1)
dummyData = pd.get_dummies(modelData)
data_features = dummyData.drop('churned', axis=1)
X_train, X_test, y_train, y_test = train_test_split(
        data_features, dummyData['churned'].values, random_state=0)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

forest.fit(X_train_scaled, y_train)
print("Accuracy on test set: {:.3f}".format(forest.score(X_test_scaled, y_test)))
Accuracy on test set: 0.760