This project aims to predict passenger survival on the Titanic using machine learning models and feature engineering techniques. The dataset is sourced from the Kaggle Titanic Competition. Through exploratory data analysis (EDA), feature engineering, and model optimization, achieved a Kaggle leaderboard score of 0.78947.
The Titanic dataset includes passenger information like age, class, fare, family relationships, and survival status. The task involves building a predictive model to classify passengers into survivors (1
) and non-survivors (0
).
- Dataset: Kaggle Titanic Dataset
- Target Variable:
Survived
(1 = survived, 0 = did not survive) - Final Kaggle Score: 0.78947
This project follows a structured machine learning pipeline:
The Titanic dataset includes passenger details such as demographic, socio-economic, and ticketing information. The target variable, Survived
, indicates whether a passenger survived (1
) or not (0
). Key insights from the dataset are as follows:
-
Survival Distribution:
- Approximately 38.4% of passengers survived, while 61.6% did not.
- This highlights an imbalanced dataset where survival is the minority class.
-
Pclass (Passenger Class):
- First-class passengers had the highest survival rate (~63%), followed by second-class (~47%), and third-class (~24%).
-
Sex:
- Females had a significantly higher survival rate (~74%) compared to males (~19%), showing a strong correlation with survival.
-
Age:
- Younger passengers (children) had higher survival rates compared to adults and seniors. Missing values in
Age
were imputed to avoid data loss.
- Younger passengers (children) had higher survival rates compared to adults and seniors. Missing values in
-
Embarked (Port of Embarkation):
- Passengers who embarked at Cherbourg (
C
) had the highest survival rate (~55%), followed by Queenstown (Q
, ~39%) and Southampton (S
, ~34%).
- Passengers who embarked at Cherbourg (
-
Fare:
- Higher fares were associated with higher survival rates, likely due to their correlation with passenger class.
-
Family Features (SibSp and Parch):
- Passengers traveling with small family groups (2–4 members) had better survival rates than solo travelers or those with large families.
-
Cabin and Deck:
- Passengers with known cabin information (e.g., Decks B, C, D) had higher survival rates than those without cabin information (
Z
).
- Passengers with known cabin information (e.g., Decks B, C, D) had higher survival rates than those without cabin information (
-
Title:
- Titles extracted from names (e.g.,
Mr.
,Mrs.
,Miss.
) revealed distinct survival trends:- Females with titles like
Miss.
andMrs.
had significantly higher survival rates. - Males with titles like
Mr.
had the lowest survival rates. - Rare titles (e.g.,
Countess
,Rev
) varied but often correlated with class or social status.
- Females with titles like
- Titles extracted from names (e.g.,
These observations were used to guide feature engineering and model development.
-
Handling Missing Values:
- Used
IterativeImputer
to fill missingAge
values based on correlated features (Pclass
,SibSp
,Parch
). - Replaced missing
Embarked
with the mode. - Filled missing
Fare
with the median and replaced missingCabin
with a placeholder (Z
).
- Used
-
Feature Engineering:
- Family Features:
- Created
FamilySize = SibSp + Parch + 1
to represent total family members onboard. - Categorized passengers into family groups:
Solo
: Alone travelersSmall
: Families of 2–4Large
: Families of 5 or more
- Created
- Deck Extraction:
- Extracted deck information from the
Cabin
feature. - Decks
D
,E
, andB
had higher survival rates, while passengers without cabin info (Z
) had the lowest.
- Extracted deck information from the
- Titles from Names:
- Extracted titles (
Mr.
,Mrs.
,Miss.
, etc.) from passenger names. - Grouped rare titles (
Lady
,Countess
,Jonkheer
) into aRare
category.
- Extracted titles (
- Binned Features:
- Created categories for
Age
andFare
:- AgeBands:
Child
,Teenager
,Adult
,Senior
- FareBands:
Low
,Medium
,High
- AgeBands:
- Created categories for
- Family Features:
-
One-Hot Encoding:
- Encoded categorical features (
Pclass
,Sex
,Embarked
, etc.) into numerical values. - Ensured no multicollinearity by dropping one category from each encoded feature.
- Encoded categorical features (
-
Scaling:
- Applied
StandardScaler
to scale numerical features (Age
,Fare
) for consistency.
- Applied
-
Split Dataset:
- Split data into training and validation sets (80%-20%) using
train_test_split
with stratification on the target variable.
- Split data into training and validation sets (80%-20%) using
-
Selected Models:
- Trained the following models with hyperparameter tuning:
- Random Forest
- Gradient Boosting
- K-Nearest Neighbors (KNN)
- XGBoost
- CatBoost
- Trained the following models with hyperparameter tuning:
-
Hyperparameter Optimization:
- Used GridSearchCV with 5-fold cross-validation to tune parameters like
n_estimators
,max_depth
, andlearning_rate
. - Ensured models were robust to overfitting by evaluating cross-validation scores.
- Used GridSearchCV with 5-fold cross-validation to tune parameters like
-
Results:
- Best models: Random Forest and XGBoost, achieving cross-validation scores of 83.85%.
-
Validation Accuracy:
- Random Forest and XGBoost outperformed other models with validation accuracy of 81.01%.
-
Feature Importance:
Pclass
,Sex
, andFare
were among the most important predictors.
-
Submission:
- Generated predictions using all models and prepared
.csv
files for Kaggle submission.
- Generated predictions using all models and prepared
-
Survival Factors:
- Gender: Females were prioritized during evacuation.
- Class: Higher-class passengers had better access to lifeboats.
- Family: Passengers with small families had higher survival chances compared to those traveling alone or in large families.
-
Feature Engineering Impact:
- Adding
FamilySize
,AgeBand
, andTitle
significantly improved model performance.
- Adding
-
Model Performance:
- Ensemble models like Random Forest and XGBoost performed better than simpler models like KNN.
Model | Training Accuracy | Validation Accuracy | Cross-Validation |
---|---|---|---|
Random Forest | 91.01% | 81.01% | 83.85% |
XGBoost | 91.85% | 81.01% | 83.85% |
Gradient Boosting | 87.08% | 78.21% | 83.85% |
K-Nearest Neighbors | 85.81% | 78.77% | 81.18% |
CatBoost | 83.85% | 80.45% | 83.85% |
Follow these steps to run the project on your local machine:
Clone the repository to your local system using the following command:
git clone https://github.com/Igoras6534/Titanic-ML-Project.git
cd Titanic-ML-Project
Create a folder called "submissions"
Ensure you have Python 3.8+ installed. To install the required libraries, run:
pip install -r requirements.txt
Launch Jupyter Notebook or Jupyter Lab and open the notebook:
jupyter notebook Titanic-Project-EDA-ML.ipynb
Now you're ready to run the notebook.