- Business Understanding 1.1 Business Problem 1.2 Dataset 1.3 Proposed Analytics Solution
- Data Exploration and Preprocessing 2.1 Data Quality Report 2.2 Missing Values and Outliers 2.3 Normalization 2.4 Transformations 2.5 Feature Selection
- Model Selection 3.1 Logistic Regression 3.2 Random Forest Classifier 3.3 KNN Classifier 3.4 Naive Bayes Classifier 3.5 AdaBoost Classifier 3.6 Gradient Boosting Classifier 3.7 XGBoost Classifier
- Evaluation 4.1 Accuracy 4.2 Sensitivity 4.3 Specificity 4.4 Precision Score 4.5 False Negative Rate 4.6 Youden’s Index 4.7 Discriminant Power 4.8 Balanced Classification Rate 4.9 Geometric Mean
- Results
Global warming is affecting ecosystems worldwide, and Australia is particularly vulnerable to the impacts of climate change, including rising temperatures, sea level rise, coral bleaching, and extreme weather events such as bushfires. One critical issue arising from these changes is food security, as agriculture relies heavily on rainfall. This project aims to predict whether it will rain in Australia the next day, with a focus on building budget-friendly rainfall forecast applications.
The dataset used for this project was obtained from Kaggle and contains 23 features and 145,461 rows. The target variable is "RainTomorrow," which indicates whether it will rain the next day. Some of the features in the dataset include:
- Date
- Location (weather station name)
- Minimum and Maximum Temperature
- Rainfall
- Evaporation
- Sunshine hours
- Wind direction and speed
- Humidity
- Atmospheric pressure
- Cloud cover
- Temperature at different times of the day
- Rain today (binary)
- Rain tomorrow (target variable)
The analytics solution proposed for this project involves the following steps:
-
Gathering Data: Data was collected from various sources, and a Kaggle dataset with relevant features for rainfall prediction was selected.
-
Data Analysis: The dataset was analyzed to gain a better understanding of its content and identify important features and trends that can aid in model building.
-
Data Preprocessing: Data quality issues were addressed, including handling missing values through imputation, and outliers were identified and managed.
-
Feature Selection: Relevant features were selected for model building using techniques such as Chi-square test, PCA, and Recursive Feature Elimination (RFE).
The data quality report includes metrics for both categorical and continuous variables, such as counts, missing values, cardinality, and key statistics.
Missing values were identified in several features and were handled through imputation. Outliers were detected using box plots and the Interquartile Range (IQR) method.
Continuous features were normalized using Min-Max normalization to bring them within the range [0, 1].
Categorical data were transformed into numerical data using one-hot encoding.
Feature selection techniques such as Chi-square test, PCA, and RFE were used to identify and select the most relevant features for model building.
Various classification models were evaluated for their effectiveness in predicting rainfall. The following models were considered:
Logistic Regression was used to model the relationship between input variables and the target variable. It achieved an accuracy of 85.03% and was evaluated using various metrics.
Random Forest, a robust ensemble algorithm, achieved an accuracy of 78.11% and was evaluated for its performance.
The K-Nearest Neighbors (KNN) classifier achieved an accuracy of 79.23% and was assessed for its effectiveness.
The Naive Bayes classifier, which assumes a normal distribution, achieved an accuracy of 78.11% and was evaluated.
AdaBoost, an ensemble technique, achieved an accuracy of 84.47% and underwent evaluation.
The Gradient Boosting classifier achieved an accuracy of 84.62%, and its performance was assessed.
The XGBoost classifier, an advanced ensemble method, achieved an accuracy of 85.62% and was evaluated.
Various evaluation metrics were used to assess the performance of the models, including accuracy, sensitivity, specificity, precision score, false negative rate, Youden’s Index, discriminant power, balanced classification rate, and geometric mean.
The results of the model evaluation are summarized in the table below:
Model | Accuracy | Sensitivity | Precision Score | False Negative Rate | Youden’s Index | Discrimination Power | Balanced Classification Rate | Geometric Mean |
---|---|---|---|---|---|---|---|---|
Logistic Regression | 0.8503 | 0.72 | 0.79 | 0.13 | 0.59 | 1.55 | 0.79 | 0.79 |
Random Forest | 0.7811 | 0.66 | 0.82 | 0.16 | 0.64 | 1.67 | 0.82 | 0.82 |
KNN Classifier | 0.7923 | 0.64 | 0.74 | 0.16 | 0.47 | 1.2 | 0.74 | 0.73 |