Introduction

With direct marketing campaigns of a Portuguese banking institution that were based on phone calls, more than one contact to the same client were often required so as to check if clients would subscribe to the bank’s term deposit.

The classification goal predicts if the client will subscribe (Y/N) to a term deposit (var y).

Data: Bank Marketing Data Set

Attributes: GitHub

Import data and libraries

setwd("D:/et4_e")

#Import the bank marketing data
bank = read.table("bank.csv",sep=";",header=TRUE)

library(ggplot2)
library(dplyr)
library(gridExtra)

#Convert from character to factor
bank[sapply(bank, is.character)] <- lapply(bank[sapply(bank, is.character)], 
                                           as.factor)
dim(bank)
## [1] 4521   17

There are 4,521 total observations and 17 variables in the bank data.

In this following section, the data will be analyzed and interpreted using different models such as Decision Tree, Logistic Regression and Random Forest. A comparison of predictive performances for the models will also be explained.

# Remove unknown values
bank1 = bank
bank1[bank1=="unknown"] <- NA
sum(is.na(bank1))
## [1] 5254
bank1 <- na.omit(bank1)

Before starting with the analysis, unknown missing values in the data were removed as it would affect the output and its bias.

With the remaining 764 observations, there are certain patterns within the observed data.

Since variables such as duration are obvious that highly affect the subscription choice, the focus will be on finding patterns within the data’s less obvious variables.

Variables such as age, job, marital status, education level and the number of contacts with the client during the campaign are explored to see the effects on the term deposit subscription choice.


Analysis of data with specific variables

# Checking if there are patterns between variables that affects Yes & No

ggplot(bank1, aes(x= age, fill = y)) + geom_boxplot(alpha = .3) + ggtitle("Age affecting subscription of term deposit")

Age is only slightly significant for affecting the predicted value.


ggplot(bank1, aes(x=job, fill=y)) + geom_bar() + ggtitle("Jobs affecting subscription of term deposit")

Jobs have a similar and even pattern between y, thus not a significant indicator.


ggplot(bank1, aes(x= marital, fill = y)) + geom_bar() + ggtitle("Marital status affecting subscription of term deposit")

Insignificant difference in pattern between marital and y, not a significant indicator.


ggplot(bank1, aes(x=education, fill = y)) + geom_bar() + ggtitle("Education level affecting subscription of term deposit")

Education have a similar even pattern, not a significant indicator.


ggplot(bank1, aes(x= duration, fill = y)) + geom_density(alpha = 0.2) + ggtitle("Duration affecting subscription of term deposit")

Benchmarking Purpose: longer the call, higher the chance of saying yes


ggplot(bank1, aes(x= campaign, fill = y)) + geom_bar() + xlim(0,20) + ggtitle("Subscribed term deposits by number of campaigns")

Significant result, since if a person says yes, then they will probably do it within first 4 campaigns.


Observations

Analyzing variables such as job as an example, there is an almost similar pattern between age, marital status, education level and the number of contacts with the client during the campaigns and subscription choice. Hence, those variables are not significant enough to be an indicator.

However, we can notice some significant observations on campaigns where if an individual were to subscribe to the term deposit, it would likely happen within the first three campaigns.

Next, we will then observe if there are any interesting analysis interpretation using different classification algorithms.


Hold-out Validation method

# Hold-out Validation method
set.seed(123)
library(caTools)

sample = sample.split(bank1$y, SplitRatio = 0.7)
train = subset(bank1, sample==T)
test = subset(bank1, sample==F)

print(dim(train)); print(dim(test))
## [1] 535  17
## [1] 229  17

Train set with 70% data (535 observations, 17 variables) Test data with 30% data (229 observations, 17 variables).

prop.table(table(train$y))
prop.table(table(test$y))

We can observe that the obtained baseline accuracy is 77 percent.


Logistic Regression

As a classification algorithm, the logistic regression will find the relationship between the variables and the probability of the particular result. (Agrawal, 2017)

# Logistic Regression
bank_glm = glm(y ~ . , family="binomial", data = train)
summary(bank_glm)
## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7109  -0.5040  -0.3183  -0.1662   2.8710  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -5.259e+00  1.405e+00  -3.742 0.000182 ***
## age                 1.629e-03  1.680e-02   0.097 0.922777    
## jobblue-collar      2.352e-01  5.664e-01   0.415 0.677986    
## jobentrepreneur     1.489e-01  9.781e-01   0.152 0.878982    
## jobhousemaid       -1.373e+00  1.345e+00  -1.021 0.307394    
## jobmanagement       2.313e-01  5.869e-01   0.394 0.693492    
## jobretired          2.663e-01  7.855e-01   0.339 0.734594    
## jobself-employed   -2.703e-01  8.477e-01  -0.319 0.749856    
## jobservices         1.045e+00  6.205e-01   1.684 0.092218 .  
## jobstudent          1.742e+00  7.948e-01   2.192 0.028402 *  
## jobtechnician       2.576e-01  5.160e-01   0.499 0.617593    
## jobunemployed      -5.495e-01  1.035e+00  -0.531 0.595393    
## maritalmarried     -4.462e-01  4.399e-01  -1.014 0.310415    
## maritalsingle      -4.582e-01  5.043e-01  -0.909 0.363500    
## educationsecondary  9.819e-01  5.717e-01   1.717 0.085890 .  
## educationtertiary   1.355e+00  6.287e-01   2.155 0.031162 *  
## defaultyes          9.119e-01  1.381e+00   0.660 0.509189    
## balance             5.704e-05  4.684e-05   1.218 0.223354    
## housingyes         -3.473e-01  3.346e-01  -1.038 0.299317    
## loanyes             2.222e-01  4.472e-01   0.497 0.619198    
## contacttelephone   -3.103e-01  5.369e-01  -0.578 0.563349    
## day                 2.952e-02  2.058e-02   1.435 0.151410    
## monthaug            2.442e+00  6.685e-01   3.652 0.000260 ***
## monthdec            1.059e+00  9.884e-01   1.071 0.284173    
## monthfeb            1.359e+00  6.577e-01   2.067 0.038770 *  
## monthjan            6.005e-01  6.485e-01   0.926 0.354493    
## monthjul            1.026e+00  8.345e-01   1.230 0.218781    
## monthjun            2.094e+00  7.505e-01   2.791 0.005261 ** 
## monthmar            1.905e+00  1.125e+00   1.694 0.090334 .  
## monthmay            4.048e-01  5.132e-01   0.789 0.430220    
## monthnov            8.334e-01  5.900e-01   1.413 0.157774    
## monthoct            1.836e+00  7.144e-01   2.570 0.010156 *  
## monthsep            1.430e+00  7.892e-01   1.812 0.070049 .  
## duration            3.321e-03  4.999e-04   6.642 3.09e-11 ***
## campaign           -6.889e-02  1.076e-01  -0.640 0.522006    
## pdays               9.220e-04  1.259e-03   0.732 0.464067    
## previous            2.134e-02  4.413e-02   0.484 0.628677    
## poutcomeother       2.749e-01  3.267e-01   0.841 0.400159    
## poutcomesuccess     2.567e+00  3.761e-01   6.825 8.78e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 569.56  on 534  degrees of freedom
## Residual deviance: 380.75  on 496  degrees of freedom
## AIC: 458.75
## 
## Number of Fisher Scoring iterations: 5
##contactunknown, month of july, duration, poutcomesuccess are the more significant variables.

We can see the few significant variables that affect our model: the outcome of the previous marketing campaign, the last contact duration and the last contact month of the year.

# Predicting with trainset
PredTrain = predict(bank_glm, data = train, type = "response")
# Confusion matrix
table(train$y, PredTrain >= 0.5)
##      
##       FALSE TRUE
##   no    390   25
##   yes    58   62
(390+62)/nrow(train)  #Accuracy - 84%
## [1] 0.8448598
# Predicting using testset
PredTest = predict(bank_glm, newdata = test, type = "response")
# Confusion matrix
table(test$y, PredTest >= 0.5)
##      
##       FALSE TRUE
##   no    163   15
##   yes    21   30
(163+30)/nrow(test)  #Accuracy - 84%
## [1] 0.8427948
## Baseline accuracy with 77%, both training and test set has 84 percent. 
## Overall, logistic regression model is better, and a good fit.

The training set’s confusion matrix shows that the model predicted 390 non-subscribers and 62 subscribers accurately but misclassified 25 non-subscribers as people who subscribed and 58 subscribers as a non-subscriber.

Using both the train and test sets, we can see that the logistic regression model would have an approximate accuracy of 84%.

The baseline accuracy for the data was 77%, while the accuracy for both training and test data is 84%.

Overall, the logistic regression model is more accurate than the baseline accuracy on both train and test datasets. Hence the model is a good fit.



Decision Tree (CART)

CART (Classification & Regression Trees) is a Decision Tree Algorithm used for classification or regression predictive modeling analysis. (Brownlee, 2016)

# Decision Tree - CART
set.seed(123)
library(rpart)
library(rpart.plot)
tree1 <- rpart(y~., data = train, method = 'class')
rpart.plot(tree1, extra = "auto")

From the top, the overall probability of subscribing to a term deposit is 22%. The node asks whether the outcome of the previous marketing campaign was a failure.

If yes, then we would move leftwards to the second node where 86% had a failed outcome of the previous marketing campaign with 15% of subscribing probability. Then in that second node, we check if the last called duration was less than 635 seconds.

If it were more than 635, we would move rightwards to the third node, where 7% had a call for more than 635 seconds with a 63% probability of subscribing.

Then we check if the job fits decision tree criteria; if it does not, then the probability of subscribing is 33%.

We will then keep going on with the pattern to understand further and analyse which variable impacts the likelihood of subscribing to a term deposit.


pred_new <- predict(tree1, test, type="class")
table_1 <- table(test$y,pred_new)
table_1
##      pred_new
##        no yes
##   no  155  23
##   yes  25  26
accuracytest <- sum(diag(table_1))/sum(table_1)
print(paste('Test Accuracy', accuracytest))
## [1] "Test Accuracy 0.790393013100437"

The model correctly predicted 155 non-subscribers and 26 subscribers. Also, the model misclassified 25 subscribers as non-subscribers and 23 non-subscribers as people who subscribed.

However, the test accuracy is still relatively high, at around 79%.


accuracytuned <- function(tree1) {
  pred_new <- predict(tree1, test, type = 'class')
  table_1 <- table(test$y, pred_new)
  accuracytest <- sum(diag(table_1)) / sum(table_1)
  accuracytest
}

control <- rpart.control(minsplit = 4,
                         minbucket = round(4/ 3),
                         maxdepth = 9,
                         cp = 0)
tuned_mod <- rpart(y~., data = train, method = 'class', control = control)
accuracytuned(tuned_mod)
## [1] 0.7991266
# Tuned parameters can lead to higher accuracy

It can also be further accurately tuned with adjustments to its parameters to get higher performance.


Random Forest

# Random Forest
library(randomForest)
set.seed(123)

# making the models with 100 trees
subscribed_rf <- randomForest(y~., data=train, importance=T, ntree=100)

importance(subscribed_rf)
##                   no        yes MeanDecreaseAccuracy MeanDecreaseGini
## age        1.6143450 -0.4551771            1.2397005       13.3604446
## job        0.7143156 -0.3070630            0.5237337       15.2407836
## marital    2.3532654 -0.4536149            1.8026621        4.6464370
## education  3.5568035 -1.8061264            2.0763904        3.7386563
## default    0.9790454 -1.0050378            0.4053407        0.3960559
## balance    0.7375421  0.4979873            1.0017616       14.2649913
## housing    2.3924199  2.1850122            3.1050201        2.8156016
## loan      -1.5801904  0.7407660           -0.5585783        0.8646363
## contact   -1.1543406  0.3268998           -0.6457181        0.6403557
## day        2.9265880 -0.4016363            2.5490644       12.0936509
## month      6.4772399  4.5769447            7.6933475       24.1711343
## duration   8.2638476 14.4117347           14.8151026       42.2763029
## campaign  -1.7458787 -0.1415128           -1.2743314        3.9368369
## pdays      5.6235433  6.1413998            7.2116391       18.9983892
## previous   0.1319018  1.0675435            0.7706671        7.8994696
## poutcome   9.4229030 13.5160211           14.8236041       21.6955763
varImpPlot(subscribed_rf)

# Duration and month are important predictors on whether customers subscribe.

Running the equation above, the Random Forest model would show the significance of each predictor in the data affecting the subscription choice. The last contact duration is the most important predictor, followed by the last contact month of the year and the previous marketing campaign’s outcome.

newtest <- predict(subscribed_rf,test)
error <- mean(newtest != test$y)
print(paste('Accuracy',1-error))
## [1] "Accuracy 0.842794759825327"
# The model is almost 84% accurate.

The accuracy of the Random Forest method would be as high as about 84%.


Conclusion

Comparing all the models, we can see that Logistic Regression has an accuracy of 84%, Decision Tree via CART with 79%, Random Forest Method with 84%.

Therefore, Logistic Regression and Random Forest would both be the ideal predictive model for classification.

Both ideal models have indicated the same few variables that are significant and will impact the prediction of subscription for a term deposit: the last contact duration, outcome of the previous marketing campaign and the last contact month of the year.


[LinkedIn] [GitHub]

© Copyright. Evangeline Tan 2021.