Advanced Strategies in Tackling Customer Churn: A Banking Case Study

5 min readNov 15, 2023

Hello, Data Explorers! Today, on “Through My Data Lens,” I delve into a fascinating case study from the banking sector. The focus? Customer churn — a term that might sound disheartening but offers a goldmine of insights when viewed through the right data lens.

In an era where data is king, banks worldwide are turning to advanced analytics to stay ahead in the fiercely competitive market. One crucial area of focus is understanding and mitigating customer churn. In this latest project, “Banking on Loyalty,” I dive deep into a bank’s dataset to unravel the mysteries of customer behavior and churn. Let us unpack how advanced data analytics can not only predict but also preempt customer churn.

Unearthing Insights

This project embarks on a comprehensive journey through data, starting from exploration to predictive modeling, and finally to deriving actionable strategies. It is not just numbers and charts; it is a narrative of how people interact with their financial institutions. Hidden within is the story of why some customers stay loyal, while others take their business elsewhere.

Data Exploration

Before diving into the analysis, I began by loading the customer dataset and conducting initial explorations. This preliminary step gave me a glimpse into various aspects of the customer base, such as demographics, account details, and churn rates.

Initial exploration revealed key points such as a younger customer base, a significant number of customers with zero balance, and a distribution across various regions, genders, and product usage.

import pandas as pd

# Load the dataset
file_path = 'bank.csv'
data = pd.read_csv(file_path)

# Initial data exploration
data.head()
data.info()
data.describe()

Overall view of the churn rate, showing the proportion of customers who have stayed and those who have exited.

The number of products held by customers, which can be a proxy for transaction behavior, is again segmented by churn status.

The distribution of customers by age group, segmented by churn status.

Insights into various aspects of the data, such as churn rate by age group, number of products, credit score, balance to salary ratio, tenure, and active member status.

Data Cleaning and Preprocessing

I focused on removing irrelevant identifiers and normalizing numerical variables for consistency. Categorical variables like geography and gender were transformed into numerical formats suitable for advanced modeling.

from sklearn.preprocessing import StandardScaler, LabelEncoder

# Dropping irrelevant columns
data_cleaned = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

# Encoding categorical variables
label_encoders = {column: LabelEncoder() for column in ['Geography', 'Gender']}
for column in label_encoders:
    data_cleaned[column] = label_encoders[column].fit_transform(data_cleaned[column])

# Normalizing numerical variables
scaler = StandardScaler()
numerical_columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
data_cleaned[numerical_columns] = scaler.fit_transform(data_cleaned[numerical_columns])

# Splitting the dataset
X = data_cleaned.drop('Exited', axis=1)
y = data_cleaned['Exited']

Addressing Class Imbalance

The data revealed a significant class imbalance in the target variable — customer churn. To address this, I employed the Synthetic Minority Over-sampling Technique (SMOTE), ensuring my predictive model was not biased or skewed towards the majority class.

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_smote, y_smote = smote.fit_resample(X, y)

Predictive Modeling & Hyperparameter Tuning

My predictive modeling contenders were Logistic Regression and Random Forest. The latter emerged as the superior model with higher accuracy (86.3%), precision (76.71%), and a better balance in recall and F1 score. Its ROC-AUC score of 84.70% indicated a strong ability to distinguish between customers who would churn and those who would not.

The Random Forest model revealed several influential factors:

Age was a leading indicator, suggesting a trend of older customers being more likely to churn.
Estimated Salary and Credit Score also had significant roles, pointing to financial factors as key influencers.
Account Balance and Engagement Metrics like the number of products used, tenure, and activity status were crucial in predicting churn.

Having settled on Random Forest, I ventured into hyperparameter tuning to enhance my model’s accuracy. Utilizing GridSearchCV, I meticulously fine-tuned my model parameters, ensuring optimal performance. Cross-validation further boosted my model’s robustness, safeguarding against overfitting.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score

# Hyperparameter tuning for Random Forest
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_smote, y_smote)
best_rf_model = grid_search.best_estimator_

# Cross-validation
cv_scores = cross_val_score(best_rf_model, X_smote, y_smote, cv=5)

Feature Engineering

I employed Polynomial Features for feature engineering, revealing intricate interactions in the data and enriching my model’s input.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_smote_poly = poly.fit_transform(X_smote)

Model Evaluation

I put my refined Random Forest model to the test. With accuracy, precision, recall, and F1 score as my metrics, I evaluated the model’s performance, in readiness to translate these numbers into real-world strategies.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Evaluating the model on the original test set
y_pred = best_rf_model.predict(X)
metrics = {
    'Accuracy': accuracy_score(y, y_pred),
    'Precision': precision_score(y, y_pred),
    'Recall': recall_score(y, y_pred),
    'F1 Score': f1_score(y, y_pred)

{'Accuracy': 0.9806,
 'Precision': 0.9221255153458543,
 'Recall': 0.9882179675994109,
 'F1 Score': 0.9540284360189574}

Model Explainability

Using SHAP (SHapley Additive exPlanations) - an explainability tool - I shed light on how each feature influenced my model’s predictions, demystifying the often-opaque nature of complex models.

Model Saving

To ensure your hard-earned model is not fleeting, save it using joblib. This is to make your model readily available for future predictions and analyses.

import joblib

# Saving the model
joblib.dump(best_rf_model, 'random_forest_model.pkl')

Banking on Loyalty Strategies

Based on my findings, I recommend the following strategies for the bank:

Age and engagement: Older customers show a higher tendency to churn. Tailoring services and communication for the older demographic could help in retention.
Financial product optimization: The model indicates certain salary and balance thresholds where churn spiked. This insight could guide the bank in adjusting offerings to cater to customers with specific salary ranges or account balances.
Rewarding loyalty: Long-tenured customers are less likely to churn. Implementing or enhancing loyalty programs by rewarding these customers and those with multiple product holdings, could further strengthen these bonds.
Leveraging active feedback: Regular, targeted feedback could preempt customer dissatisfaction, nipping potential churn in the bud.

Conclusion

“Banking on Loyalty” stands as a testament to the power of advanced data techniques in navigating the complex seas of customer behavior. As financial services continue to harness these tools, the potential to not just understand but also proactively shape customer journeys seems limitless. In this age of data, banks that adeptly use these insights will not only reduce churn but also carve a path to enduring customer loyalty and trust.

Predictive-Modeling-Machine-Learning/Bank Churn Analysis & Prediction at…

Using algorithms and statistical models to predict future outcomes, ranging from traditional regression to advanced…

github.com

Datafully Yours, The_AIProdigy