This project is a learning experience and emerged as a solution to the COVID 19 Death Prediction Kaggle challange. This is not an ideal solution, no significant amount of data analysis went into this project, for insights into data analysis please see other entries to the challange with the link above.
We are looking for a Machine Learning Algorithm that is able to predict the number of people that are likely going to die this week, based on some given disease information from last week.
This Information (set of samples) consists of:
- The given samples' Location (string/object)
- Total Weekly Cases (number/float)
- The year that the sample was created (number/int)
- Weekly Cases per 1.000.000 people (number/float)
- Weekly Deaths for this week (number/float)
- Weekly Deaths per 1.000.000 people (number/float)
- Total number of administered Vaccinations (number/float)
- Total number of people who recieved at least one Vaccine dose (number/float)
- Total number of people who recieved all doses prescribed by the initial vaccination protocol (number/float)
- Next weeks Covid deaths (number/float)
We are looking for:
- Next weeks Covid deaths (number/float)
The notebook consists of three parts:
which does some basic data preparation (null eviction/ string to number transfer with dictionary/ ...) and prepares all the dataframes,
which uses the dataframes to train a HistogramGradientBoostingRegressor from scikit learn (or any other model) and displays the model test score and training time,
which imports a pickled (exported) model and
which dumps the last trained algorithm into a pickle file, uses the model to predict the challenges test.csv and stores both in the project directory.
To produce your own model with the given dataset, it should suffice to clone
the repo, install
the necessary modules using pip (see imports in the notebook) and then execute
the first and second cell in that order. This is an older, unmaintaned notebook, correctness or functionality are neither guaranteed nor looked after.
The missing further data analysis was most likely hindering the models performance, if we assume that the data set was not already checked for relevant corellation patterns, which it usually is not. Even just a simple analysis of the data correlation might have helped to mitigate datapoints that do not show any significance to the problem at hand, therefore improving the overall performance of the resulting model. Finding hidden data links or discovering new corellating data points from within the data set (eg: name: "Mr. ..." -> reveals classifiable gender) is unlikely due to the bare bones nature of all of the data points.
The choice for the HistogramGradientBoostingRegressor was mostly a result of its quick training times and promising result and less of an educated decision. With more time and effort, one might be able to construct an ideal Machine Learning Model with either less or more complexity, or find a better fitting model within the scikit learn module.