Customer churn is a critical challenge faced by banks and financial institutions, leading to significant revenue loss and reduced business growth. This project aims to address this issue by utilizing Apache Spark, a powerful distributed computing framework, to analyze bank customer data. By building a churn prediction model, we seek to identify customers at risk of churning and provide insights that can guide effective customer retention strategies, ultimately improving the institution's overall performance and customer satisfaction.
- 1. Data Exploration
- 1.1 Data Loading
- 1.2 Exploring Dataset
- 1.3 Data Visualization
- 2. Data Preprocessing
- 3. Modelling
- 4. Model Tuning
1. Install Java Development Kit (JDK)
Spark requires Java to be installed on your system. Download and install the latest version of JDK from the Oracle website or any other official source.
2. Download Spark
Visit the Apache Spark website and download the latest version of Spark. Unzip the downloaded file to a preferred location on your computer.
3. Set Environment Variables
Set the following environment variables in your system:
JAVA_HOME: Point it to the directory where Java is installed.
SPARK_HOME: Point it to the Spark installation directory.
PATH: Add %SPARK_HOME%\bin and %SPARK_HOME%\sbin to the PATH variable.
4. Install FindSpark
FindSpark is a Python library that allows Jupyter Notebook to locate Spark installed on your system. Install it using pip:
pip install findspark
5. Configure Jupyter Notebook
Make sure you have Jupyter Notebook installed. If not, install it using:
pip install jupyter
6. Start Jupyter Notebook
Launch Jupyter Notebook by running the following command in your terminal:
jupyter notebook
A web browser will open with the Jupyter interface.
7. Connect Jupyter Notebook to Spark
In your Jupyter Notebook, create a new notebook and add the following code to connect it to Spark:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ChurnAnalysis").getOrCreate()
You are now ready to use Spark within Jupyter Notebook for Churn Analysis!
The contains various features related to bank customers, such as age, balance, gender, tenure, and other relevant attributes. The target variable is the "churn" column, indicating whether a customer has churned (1) or not (0). The dataset is available here.
If you want to learn more about big data technologies you can read my medium article here.
This project is licensed under the Apache License 2.0 License. Feel free to use and modify the code as per the license terms.
Contributions to this repository are welcome! If you have any ideas for improvements, feel free to create a pull request.✨💪