IH: W2 Project - Data cleaning & wrangling
We were provided with a raw dataset from the web Kaggle about shark attacks registered: time and date when they were registered, characteristics of the victim, lethality of the attacks and details about how the events were investigated. (https://www.kaggle.com/teajay/global-shark-attacks#attacks.csv)
Our objective is to clean the dataset until we have a database we can work with and contrast the next three hypotheses:
- Most of the provoked shark attacks are provoked by Americans.
- Distribution of shark attacks is correlated with the age of the victims.
- Shark attacks ocurr mainly in the afternoon.
-
Looking for NaN values: Raw dataset contents many rows an columns which are filled with NaN values. In order to improve the quality of the dataset we get rid of it by using the drop and dropna methods of pandas dataframe once we are sure no important information will be lose.
-
Looking for duplicated values: some columns have duplicated information, or really similar information. We drop it once we determine which column is more appropiate. Also we drop columns that have not usefull data for de hypothesis we have planted.
-
Consistency of the data (Here comes the big prize!): We have to check if after getting rid of "no data" data, the data we get is useful for an analysis. We proceed column by column, checking the quality of data, fixing them when possible and dropping them when not possible.
-
Exporting data.
-
Most of the provoked shark attacks are provoked by American: We contrast the number of provoked shark attacks with the number of total attacks by countries. It comes to English been the nationality with a bigger proportion of provoked shark attacks.
-
Distribution of sharks attacks correlates with the age of the victims: We find that distribution of shark attacks with age has a well defined maximum around 19 years.
-
Shark attacks ocurr mainly in the afternoon: We classify shark attacks based on the moment of the day they occur and found that most of the attacks occur in the afternoon (which made sense due to people tend to have them leisure time during the afternoon) but follow close by attacks occurring during the morning.