generated from jtr13/EDAVtemplate
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path04-missing.Rmd
131 lines (90 loc) · 3.12 KB
/
04-missing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# Missing values
```{r setup, include=FALSE}
# keep this chunk in your .Rmd file
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
```{r}
library(tidyverse)
library(ggplot2)
library(mi)
library(naniar)
```
## Human Trafficking
```{r}
#df_fbi_ht <- readr::read_csv("datasets/HT_2013_2019.csv")
fbiHT_df <- as.data.frame(df_fbi_ht)
x <- missing_data.frame(fbiHT_df)
image(x)
```
Alternatively, we can also view missing data as we can see below:
```{r}
colSums(is.na(df_fbi_ht))
```
PUB_AGENCY_UNIT and COUNTY_NAME contain most missing values, and since there is no way to impute this information, we will ignore it for our analysis.
It is also useful to identify if this dataset contains information about all fifty states:
```{r echo=TRUE}
length(unique(df_fbi_ht$STATE_NAME))
```
As seen above this dataset contains information about only 45 states. We handle this problem when cleaning the dataset.
## NYPD Arrests Data
```{r}
gg_miss_var(df_arrest_ytd)
```
The main feature of concern here is LAW_CAT_CD which depicts level of offense (felony, misdemeanor, violation), but since there is no way to impute this information of a categorical variable, we will ignore this feature for our analysis.
There are some missing values for feature OFNS_DESC. Since this is an important feature for our arrests analysis, and since the percentage of missing values is very low, we filter out these missing values.
## Shootings in NYC
```{r}
gg_miss_var(df_shooting_total)
```
- dropping the missing perpetrator columns, as they cannot be imputed in anyway. We intend to focus on the victim trend in this analysis. Hence, it is fine to not consider the above columns in the current analysis.
## FBI Drug Arrests Data
*For FBI Drug Arrests data, we analyze missing values as following:*
```{r}
df_fbi_drug_arrests <- readr::read_csv("datasets/arrests_national_drug.csv")
drug_df <- as.data.frame(df_fbi_drug_arrests)
y <- missing_data.frame(drug_df)
image(y)
```
Alternatively, we can also view missing data as we can see below:
```{r}
colSums(is.na(df_fbi_drug_arrests))
```
```{r}
gg_miss_var(df_fbi_drug_arrests)
```
State_abbr has been found to be completely missing.
## Hate Crimes
*For Hate Crimes data, we analyze missing values as following:*
```{r}
df_hate <- read.csv("datasets/NYPD_Hate_Crimes.csv")
df_hate$Full.Complaint.ID <- as.character(df_hate$Full.Complaint.ID)
```
We have converted complaint ID from int to character string now.
```{r}
summary(df_hate)
```
We now explore if there are missing values in this dataset.
```{r}
colSums(is.na(df_hate))
```
Alternate Method:
```{r}
sapply(df_hate, function(x) sum(is.na(x)))
mdhate <- missing_data.frame(df_hate)
image(mdhate)
```
```{r}
gg_miss_var(mdhate)
```
Arrest.Id and Arrest.Date have the same missing pattern. Other.Motive.Description has been found to have more than 700 missing values.
## Park Crime
*For Park Crime data, we analyze missing values as following:*
```{r}
df_parks <- read.csv('datasets/parks/finalpark.csv')
colSums(is.na(df_parks))
```
```{r}
mdparks <- missing_data.frame(df_parks)
gg_miss_var(mdparks)
```
None of the variables in this dataset are missing.