The purpose of this notebook is to reproducibly and defensibly document the development of a model to predict customer churn at Two Providores. EDA and production are not documented in this notebook.
Two Providores is a business-to-business food distributor. Because the company records all transactions against customer accounts, the start of the customer relationship is clearly defined. For completeness and relevance, this model includes only those customers who started purchasing from the business after 1 January 2008.
The end of the customer relationship, by contrast, is not always so clearly defined. In this model, customer 'churn' is defined probabilistically: when the interval between the last invoice and the present date is greater than the 99th percentile of the corresponding exponential probability distribution function.
It is worth noting here that instances of 'involuntary' customer churn—for example where a business closed, defaulted on payment, or became insolvent—have been removed from the dataset on which this model is trained.
The possible feature set for model training was significantly restricted due to longstanding organisational practices at Two Providores of deleting data associated with old suppliers, products, customers, and customer contacts. One attribute that remains—customer pricing group—is strongly correlated with the time at which the customer relationship started, and has been excluded from the model to prevent overfitting on the training dataset.
Thus, for this model, the feature set consisted of:
For the temporal and quantitative attributes described above, these features were measured at a range of different intervals (at the start, on average, and most recently) for each customer, and the most relevant intervals where selected inductively from various model results (more below).
The model is trained on a rectangular dataset built in several steps in order to reduce computational expenses:
import warnings
warnings.filterwarnings('ignore')
# SQL pulls
import pyodbc
import pandas as pd
connectionString = """DRIVER={SQL Server};
SERVER=****;
DATABASE=****;
UID=****;
PWD=****"""
allCustomerDataQuery = """SELECT RTRIM(Customers.UniqueID) AS CustomerID
, RTRIM(Customers.CustomerTitle) AS CustomerTitle
, RTRIM(Customers.ZCCategory) AS BusinessCategory
FROM Customers;"""
conn = pyodbc.connect(connectionString)
allCustomerData = pd.read_sql_query(allCustomerDataQuery, conn)
conn.close()
salesDataQuery = """SET NOCOUNT ON;
DECLARE @Customers table(
CustomerID char(17)
);
INSERT INTO @Customers
SELECT AccountID
FROM (
SELECT DISTINCT RTRIM(TransHeaders.AccountID) AS AccountID
, RTRIM(Customers.CustomerTitle) AS [Customer title]
, MIN(TransHeaders.TradingDate) OVER (PARTITION BY TransHeaders.AccountID) AS [First sale]
, MAX(TransHeaders.TradingDate) OVER (PARTITION BY TransHeaders.AccountID) AS [Last sale]
FROM TransHeaders
INNER JOIN Customers ON Customers.UniqueID = TransHeaders.AccountID
WHERE TransHeaders.TransactionType IN ('CI', 'CX')
AND Customers.ZCCategory != 'Staff'
AND (Customers.ZReasonInactive IS NULL
OR Customers.ZReasonInactive = 'Customer stopped buying')
) AS DistinctCustomers
WHERE [First sale] >= CONVERT(date, '2008-01-01')
ORDER BY AccountID;
SELECT RTRIM(TransHeaders.AccountID) AS CustomerID
, RTRIM(TransDetails.ItemAcID) AS ProductID
, TransDetails.LocalGross * -1 AS LocalGross
, RTRIM(TransHeaders.TransactionID) AS TransactionID
, TransHeaders.TradingDate
FROM TransDetails
INNER JOIN TransHeaders ON TransHeaders.TransactionID = TransDetails.TransactionID
AND TransHeaders.CurrentVersionNo = TransDetails.CurrentVersionNo
WHERE TransHeaders.TransactionType IN ('CI', 'CX')
AND TransHeaders.AccountID IN (
SELECT CustomerID FROM @Customers
);"""
conn = pyodbc.connect(connectionString)
salesData = pd.read_sql_query(salesDataQuery, conn)
conn.close()
# transaction data aggregated and cached in invoices dataframe
import numpy as np
indexLevels = (
salesData[['CustomerID', 'TradingDate']]
.drop_duplicates()
.sort_values(['CustomerID', 'TradingDate'])
)
invoices = (
pd.DataFrame(index=[indexLevels['CustomerID'], indexLevels['TradingDate']])
)
invoices['LocalGross'] = (
salesData.sort_values('TradingDate')
.groupby(['CustomerID', 'TradingDate'])['LocalGross']
.aggregate(np.sum)
)
invoices['noSKUs'] = (
salesData.sort_values('TradingDate')
.groupby(['CustomerID', 'TradingDate'])['ProductID']
.nunique()
)
invoices = invoices.reset_index(level=1)
invoices['DaysBetweenInvoices'] = (
(invoices['TradingDate'] - invoices.groupby(invoices.index)['TradingDate'].shift(+1))
.dt.days
)
# customers dataframe built from allCustomerData and invoices dataframes
customers = (
pd.DataFrame(index=
salesData['CustomerID']
.drop_duplicates()
.sort_values()
.values)
.merge(allCustomerData, how='inner', right_on='CustomerID', left_index=True)
.set_index('CustomerID')
)
customers['firstSale'] = (
invoices.groupby(invoices.index)['TradingDate']
.aggregate(np.min)
)
customers['lastSale'] = (
invoices.groupby(invoices.index)['TradingDate']
.aggregate(np.max)
)
customers['daysLifetime'] = (
((customers['lastSale'] - customers['firstSale']).dt.days) + 1
)
customers['noDaysInvoiced'] = (
invoices.groupby(invoices.index)['TradingDate']
.nunique()
)
As described above, this model was developed iteratively:
To facilitate the production of an excess of features from raw data, the following two functions are used extensively to produce sequences of summary statistics.
def averageLastN(n, column):
return (invoices
.groupby(invoices.index)
.apply(pd.DataFrame.nlargest, n, 'TradingDate')[column]
.groupby(level=0)
.aggregate(np.nanmean))
def averageFirstN(n, column):
return (invoices
.groupby(invoices.index)
.apply(pd.DataFrame.nsmallest, n, 'TradingDate')[column]
.groupby(level=0)
.aggregate(np.nanmean))
# customer ordering frequency and temporal changes in frequency
customers['avgDaysBetweenInvoices'] = (
invoices.groupby(invoices.index)['DaysBetweenInvoices']
.aggregate(np.mean)
)
customers['avgDaysBetweenInvoices'] = (
customers.groupby('BusinessCategory')['avgDaysBetweenInvoices']
.transform(lambda x: x.fillna(x.mean()))
)
customers['SDDaysBetweenInvoices'] = (
invoices.groupby(invoices.index)['DaysBetweenInvoices']
.aggregate(np.std)
)
customers['intervalLast1'] = averageLastN(1, 'DaysBetweenInvoices')
customers['intervalLast2'] = averageLastN(2, 'DaysBetweenInvoices')
customers['intervalLast3'] = averageLastN(3, 'DaysBetweenInvoices')
customers['intervalLast4'] = averageLastN(4, 'DaysBetweenInvoices')
customers['intervalLast5'] = averageLastN(5, 'DaysBetweenInvoices')
customers['intervalLast6'] = averageLastN(6, 'DaysBetweenInvoices')
customers['intervalChangeL1'] = customers['intervalLast1'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeL2'] = customers['intervalLast2'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeL3'] = customers['intervalLast3'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeL4'] = customers['intervalLast4'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeL5'] = customers['intervalLast5'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeL6'] = customers['intervalLast6'] / customers['avgDaysBetweenInvoices']
customers['intervalFirst2'] = averageFirstN(2, 'DaysBetweenInvoices')
customers['intervalFirst3'] = averageFirstN(3, 'DaysBetweenInvoices')
customers['intervalFirst4'] = averageFirstN(4, 'DaysBetweenInvoices')
customers['intervalFirst5'] = averageFirstN(5, 'DaysBetweenInvoices')
customers['intervalFirst6'] = averageFirstN(6, 'DaysBetweenInvoices')
customers['intervalChangeF2'] = customers['intervalFirst2'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeF3'] = customers['intervalFirst3'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeF4'] = customers['intervalFirst4'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeF5'] = customers['intervalFirst5'] / customers['avgDaysBetweenInvoices']
customers['intervalChangeF6'] = customers['intervalFirst6'] / customers['avgDaysBetweenInvoices']
# customer basket size and temporal changes in size
customers['avgNoSkusPerDay'] = (
invoices.groupby(invoices.index)['noSKUs']
.aggregate(np.average)
)
customers['SDNoSkus'] = (
invoices.groupby(invoices.index)['noSKUs']
.aggregate(np.std)
)
customers['skusLast1'] = averageLastN(1, 'noSKUs')
customers['skusLast2'] = averageLastN(2, 'noSKUs')
customers['skusLast3'] = averageLastN(3, 'noSKUs')
customers['skusLast4'] = averageLastN(4, 'noSKUs')
customers['skusLast5'] = averageLastN(5, 'noSKUs')
customers['skusLast6'] = averageLastN(6, 'noSKUs')
customers['skusChangeL1'] = customers['skusLast1'] / customers['avgNoSkusPerDay']
customers['skusChangeL2'] = customers['skusLast2'] / customers['avgNoSkusPerDay']
customers['skusChangeL3'] = customers['skusLast3'] / customers['avgNoSkusPerDay']
customers['skusChangeL4'] = customers['skusLast4'] / customers['avgNoSkusPerDay']
customers['skusChangeL5'] = customers['skusLast5'] / customers['avgNoSkusPerDay']
customers['skusChangeL6'] = customers['skusLast6'] / customers['avgNoSkusPerDay']
customers['skusFirst1'] = averageFirstN(1, 'noSKUs')
customers['skusFirst2'] = averageFirstN(2, 'noSKUs')
customers['skusFirst3'] = averageFirstN(3, 'noSKUs')
customers['skusFirst4'] = averageFirstN(4, 'noSKUs')
customers['skusFirst5'] = averageFirstN(5, 'noSKUs')
customers['skusFirst6'] = averageFirstN(6, 'noSKUs')
customers['skusChangeF1'] = customers['skusFirst1'] / customers['avgNoSkusPerDay']
customers['skusChangeF2'] = customers['skusFirst2'] / customers['avgNoSkusPerDay']
customers['skusChangeF3'] = customers['skusFirst3'] / customers['avgNoSkusPerDay']
customers['skusChangeF4'] = customers['skusFirst4'] / customers['avgNoSkusPerDay']
customers['skusChangeF5'] = customers['skusFirst5'] / customers['avgNoSkusPerDay']
customers['skusChangeF6'] = customers['skusFirst6'] / customers['avgNoSkusPerDay']
# customer basket value and temporal changes in value
customers['avgValuePerDay'] = (
invoices.groupby(invoices.index)['LocalGross']
.aggregate(np.average)
)
customers['SDValue'] = (
invoices.groupby(invoices.index)['LocalGross']
.aggregate(np.std)
)
customers['valueLast1'] = averageLastN(1, 'LocalGross')
customers['valueLast2'] = averageLastN(2, 'LocalGross')
customers['valueLast3'] = averageLastN(3, 'LocalGross')
customers['valueLast4'] = averageLastN(4, 'LocalGross')
customers['valueLast5'] = averageLastN(5, 'LocalGross')
customers['valueLast6'] = averageLastN(6, 'LocalGross')
customers['valueChangeL1'] = customers['valueLast1'] / customers['avgValuePerDay']
customers['valueChangeL2'] = customers['valueLast2'] / customers['avgValuePerDay']
customers['valueChangeL3'] = customers['valueLast3'] / customers['avgValuePerDay']
customers['valueChangeL4'] = customers['valueLast4'] / customers['avgValuePerDay']
customers['valueChangeL5'] = customers['valueLast5'] / customers['avgValuePerDay']
customers['valueChangeL6'] = customers['valueLast6'] / customers['avgValuePerDay']
customers['valueFirst1'] = averageFirstN(1, 'LocalGross')
customers['valueFirst2'] = averageFirstN(2, 'LocalGross')
customers['valueFirst3'] = averageFirstN(3, 'LocalGross')
customers['valueFirst4'] = averageFirstN(4, 'LocalGross')
customers['valueFirst5'] = averageFirstN(5, 'LocalGross')
customers['valueFirst6'] = averageFirstN(6, 'LocalGross')
customers['valueChangeF1'] = customers['valueFirst1'] / customers['avgValuePerDay']
customers['valueChangeF2'] = customers['valueFirst2'] / customers['avgValuePerDay']
customers['valueChangeF3'] = customers['valueFirst3'] / customers['avgValuePerDay']
customers['valueChangeF4'] = customers['valueFirst4'] / customers['avgValuePerDay']
customers['valueChangeF5'] = customers['valueFirst5'] / customers['avgValuePerDay']
customers['valueChangeF6'] = customers['valueFirst6'] / customers['avgValuePerDay']
The target feature, 'churn', is defined probabilistically: when the interval between the last invoice and today is greater than the 99th percentile of the corresponding exponential probability distribution function.
customers['churned'] = pd.Timestamp.today() > (
customers['lastSale'] + pd.to_timedelta(((-1 * customers['avgDaysBetweenInvoices']) * np.log(0.01)), unit='d')
)
Given the business's operating context, the model is restricted to predicting churn amongst customers who have received more than two orders.
modelData = customers[customers['noDaysInvoiced'] > 2][[
'churned',
'daysLifetime', 'noDaysInvoiced',
'avgDaysBetweenInvoices',
'intervalLast1', 'intervalLast2', 'intervalLast3',
'intervalLast4', 'intervalLast5', 'intervalLast6',
'intervalFirst2', 'intervalFirst3',
'intervalFirst4', 'intervalFirst5', 'intervalFirst6',
'intervalChangeL1', 'intervalChangeL2', 'intervalChangeL3',
'intervalChangeL4', 'intervalChangeL5', 'intervalChangeL6',
'intervalChangeF2', 'intervalChangeF3',
'intervalChangeF4', 'intervalChangeF5','intervalChangeF6',
'avgNoSkusPerDay',
'skusLast1', 'skusLast2', 'skusLast3',
'skusLast3', 'skusLast4', 'skusLast6',
'skusFirst1', 'skusFirst2', 'skusFirst3',
'skusFirst3', 'skusFirst4', 'skusFirst6',
'skusChangeL1', 'skusChangeL2', 'skusChangeL3',
'skusChangeL4', 'skusChangeL5', 'skusChangeL6',
'skusChangeF1', 'skusChangeF2', 'skusChangeF3',
'skusChangeF4', 'skusChangeF5', 'skusChangeF6',
'avgValuePerDay',
'valueLast1', 'valueLast2', 'valueLast3',
'valueLast4', 'valueLast5', 'valueLast6',
'valueFirst1', 'valueFirst2', 'valueFirst3',
'valueFirst4', 'valueFirst5', 'valueFirst6',
'valueChangeL1', 'valueChangeL2', 'valueChangeL3',
'valueChangeL4', 'valueChangeL5', 'valueChangeL6',
'valueChangeF1', 'valueChangeF2', 'valueChangeF3',
'valueChangeF4', 'valueChangeF5', 'valueChangeF6',
'BusinessCategory']]
modelData = modelData.fillna(1) # intervals greater than n (average / average = 1)
dummyData = pd.get_dummies(modelData)
data_features = dummyData.drop('churned', axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
data_features, dummyData['churned'].values, random_state=0)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {'C' : [2, 3, 4],
'gamma' : [0.001, 0.01, 0.1]}
grid_search = GridSearchCV(SVC(kernel='rbf', random_state=0), param_grid, cv=5,
return_train_score=True)
grid_search.fit(X_train_scaled, y_train)
print("Test set score: {:.2f}".format(grid_search.score(X_test_scaled, y_test)))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
svm = (
SVC(kernel='rbf', C=3, gamma=0.01, random_state=0)
.fit(X_train_scaled, y_train)
)
# random forest
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators' : [4000, 5000, 6000]}
grid_search = GridSearchCV(RandomForestClassifier(random_state=0), param_grid, cv=5,
return_train_score=True)
grid_search.fit(X_train_scaled, y_train)
print("Test set score: {:.2f}".format(grid_search.score(X_test_scaled, y_test)))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
forest = (
RandomForestClassifier(n_estimators=5000, random_state=0)
.fit(X_train_scaled, y_train)
)
# gradient boosted regression tree
from sklearn.ensemble import GradientBoostingClassifier
param_grid = {'n_estimators' : [400, 600, 800],
'max_depth' : [2, 3, 4],
'learning_rate' : [0.100, 0.125, 0.150]}
grid_search = GridSearchCV(GradientBoostingClassifier(random_state=0), param_grid, cv=5,
return_train_score=True)
grid_search.fit(X_train_scaled, y_train)
print("Test set score: {:.2f}".format(grid_search.score(X_test_scaled, y_test)))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
gbrt = (
GradientBoostingClassifier(n_estimators=400, max_depth=2, learning_rate=0.1, random_state=0)
.fit(X_train_scaled, y_train)
)
# compare area under receiver operating characteristics curves
from sklearn.metrics import roc_auc_score
svm_auc = roc_auc_score(y_test, svm.decision_function(X_test_scaled))
forest_auc = roc_auc_score(y_test, forest.predict_proba(X_test_scaled)[:, 1])
gbrt_auc = roc_auc_score(y_test, gbrt.predict_proba(X_test_scaled)[:, 1])
print("AUC for Support Vector Machine: {:.3f}".format(svm_auc))
print("AUC for Random Forest: {:.3f}".format(forest_auc))
print("AUC for Gradient Bossted Regression Tree: {:.3f}".format(gbrt_auc))
Three models were used to inductively select features from the excess of features tested:
Using the results of these models, a smaller feature could be extracted for:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.6, max_iter=100000, random_state=0).fit(X_train, y_train)
FeatureImportancesLasso = (
pd.DataFrame(data={'Absolute coefficient' : abs(lasso.coef_[lasso.coef_ != 0])},
index=data_features.columns.values[lasso.coef_ != 0])
).sort_values('Absolute coefficient', ascending=False)
FeatureImportancesLasso
FeatureImportancesForest = pd.DataFrame(data={'Importance' : forest.feature_importances_},
index=data_features.columns.values).sort_values('Importance', ascending=False)
FeatureImportancesForest.head(15)
FeatureImportancesGBRT = pd.DataFrame(data={'Importance' : gbrt.feature_importances_},
index=data_features.columns.values).sort_values('Importance', ascending=False)
FeatureImportancesGBRT.head(15)
Using the results above, the random forest model was re-trained on a significantly smaller feature set—46 vs. 102 features, including one-hot encoded features—without a loss in accuracy.
modelData = customers[customers['noDaysInvoiced'] > 2][[
'churned',
'daysLifetime', 'noDaysInvoiced',
'avgDaysBetweenInvoices',
'intervalLast1',
'intervalFirst3', 'intervalFirst6',
'intervalChangeL1', 'intervalChangeL6',
'avgNoSkusPerDay',
'skusChangeF1', 'skusChangeL1',
'avgValuePerDay',
'valueLast1', 'valueLast3', 'valueLast6',
'valueFirst1', 'valueFirst6',
'valueChangeL1', 'valueChangeL3',
'BusinessCategory']]
modelData = modelData.fillna(1)
dummyData = pd.get_dummies(modelData)
data_features = dummyData.drop('churned', axis=1)
X_train, X_test, y_train, y_test = train_test_split(
data_features, dummyData['churned'].values, random_state=0)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
forest.fit(X_train_scaled, y_train)
print("Accuracy on test set: {:.3f}".format(forest.score(X_test_scaled, y_test)))