-
Notifications
You must be signed in to change notification settings - Fork 4
/
02-data_wrangling.Rmd
215 lines (154 loc) · 8.42 KB
/
02-data_wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# Data Wrangling {#datawrangling}
## Intro
Data Wrangling is the process of cleaning and manipulating your data.frame. I have never obtained a dataset that did not need to be cleaned or organized in some fashion (if you have I am quite jealous).
The `DataExplorer` [package](https://github.com/boxuancui/DataExplorer) allows you to get a preliminary look at your data. It will check for missing data
## Exploratory Data Analysis
Phase I of my "statistics" is usually termed "Data Exploration" or "Exploratory Data Analysis"
The goal of this step is to gain valuable insights through the data so that one can know what is going on with the data, which part needs to be cleaned, what new features can be built, build hypotheses to be tested during the model creation/validation phase, or even just knowing some fun facts about the data ([src](https://towardsdatascience.com/powerful-packages-to-boost-your-exploratory-data-analysis-performance-bb0db4744530)).
My 2 favorite packages to get a glimpse of the data are
1. SmartEDA
2. [DataExplorer](https://github.com/boxuancui/DataExplorer)
3. summarytools
4. dataMaid
5. [janitor](https://sfirke.github.io/janitor/articles/janitor.html) and [here](https://github.com/sfirke/janitor)
## Creating Report with DataExplorer
```{r, echo=TRUE, eval=FALSE}
create_report(
df.fa, # the name of your dataframe
#y = 'heart_disease',
output_dir = 'output', # where do you want it to be saved relative to your project directory
output_file = 'data_explorer_fa_report.html', # the filename for the report
report_title = 'DTI (FA) Data Description' # the Title of your report
)
```
```{r, echo=TRUE, eval=FALSE}
SmartEDA::ExpNumStat(tbl.desc, round = 1)
ExpNumStat(
tbl.desc,
by = "GA",
gp = "Group",
Qnt = c(.1, .9),
Outlier = TRUE,
round = 1
)
ExpNumViz(tbl.desc, target = 'Group')
summarytools::dfSummary(
tbl.desc,
varnumbers = FALSE,
round.digits = 2,
plain.ascii = FALSE,
style = "grid",
graph.magnif = .33,
valid.col = FALSE,
tmp.img.dir = "img"
)
```
### Reading Material : Exploratory Data Analysis
1. https://m-clark.github.io/exploratory-data-analysis-tools/
2. https://blog.revolutionanalytics.com/2018/02/dataexplorer.html
3. https://www.avery-robbins.com/2020/06/01/first-post-ever/
## Tasks
### Inspecting the imported data
Among the issues you may encounter with your data is inconsistencies in columnnames. This can be rather annoying. In order to alleviate this, I like to pick one format and stick with it. When I receive a dataset from another researcher I will often use [snakecase](https://github.com/Tazinho/snakecase) to modify their dataset accordingly.
### Choosing good column names
1. Avoid using spaces. This makes it more tedious to refer or call a column later on. In the example below our task is to rename the column "Group Level". It should also be noted that things can get tricky when you are using certain packages and will over time, be a large burden to you. Hence, my advice is to remove spaces. Consider going into snakecase.
```{r, echo=TRUE, eval = FALSE}
df$`Group Level` # Column name with spaces
df$GroupLevel # Column name without spaces
```
### replace strings using pipes
In the code below I want to replace text in the "Test" column. Notice how I call the column once and then use the pipe operator to list the rest
```{r, echo=TRUE, eval=FALSE}
df %>%
mutate(Test = str_replace_all(Test, ":", "×") %>%
str_replace_all("group2.L", "Group") %>%
str_replace_all("mriloc2", "") %>%
str_replace_all("hemisphereLeft", "LeftHemi") %>%
str_replace_all("hemisphereRight", "RightHemi")
)
```
### Useful Links
Below is a list of pages I have collected that have come in very useful for beginners
#### Selecting using the tidyverse
https://suzan.rbind.io/2018/02/dplyr-tutorial-3/#filtering-based-on-a-exact-character-variable-matches
#### Working with dates
We won't cover working with dates much in this guide, because its not typical. However, should you need them the link below provides a general walkthrough.
https://www.r-bloggers.com/a-comprehensive-introduction-to-handling-date-time-in-r-2/
````{r, echo=TRUE, eval = FALSE}
# Adding Names to Columns ==================
# This can be required when using certain plotting functions like "likert" --> see ALPH likert4.R
names(items) <- c(
symptoms_decLOC="Did the patient have loss of consciousness?",
symptoms_headache="Did the patient have headaches?",
symptoms_nausea_vomitting="Did the patient have nausea/vomiting?",
symptoms_cranialNerve_paresis="Did the patient have cranial nerve paresis?",
symptoms_gait_disturbance="Did the patient have any gait disturbances?")
# Data Wrangling ----------------
df1$Group <- rep("Body", length(df1$`File name`))
df2$Group <- rep("Genu", length(df2$`File name`))
df3$Group <- rep("Splenium", length(df3$`File name`))
idx <- unique(df3l$`File name`) == df1$`File name`
dfx <- df1 %>%
select_if(df1$`File name` == unique(df3$`File name`))
combinedDF <- rbind(df1,df2,df3)
df1[!(df1$`File name` %in% df3$`File name`)]
idx <- setdiff(df1$`File name`, df2$`File name`)
df1 <- df1[,idx]
# Convert several columns to numeric ----------
cols <- grep(pattern = "CC_F|Left|Right", x = names(df), value = TRUE)
df[, cols] <- lapply(df[ , cols], function(x) suppressWarnings(as.numeric(x))) #supressWarnings is so you don't get "NAs introduced by coercion" in your console output
# Count the number of NA's in each column of a dataframe ----------
sapply(df, function(x) sum(is.na(x))) #this is different than "summary(df)" which gives you information on more than NA's
# Add column in specific spot ========
df <- add_column(df, "DMT_line2" = df$DMT_line, .after = "DMT_line")
# Add a blank column in a specific spot =====
items <- add_column(items, "Avg" = replicate(length(items$value.D1), NA) , .after = "value.D7")
# Remove blank rows in a specific column =======
df <- df[-which(df$start_pc == ""), ]
# Removing a column based on a string match =========
df <- filter(df, C != "Foo")
# Renaming a variable in a column : str_replace (this won't take exact strings)=============
df1$ID <- df$ID %>%
str_replace("pi6437934_2", "S2")
# Replace an exact string -------------
data$OpenBCI_FileName <- gsub("\\<y\\>","", data$OpenBCI_FileName) # replaces cols with "y" but won't touch something like "My Drive"
# if you have multiple strings to replace you can use the pipe operator to get everything done in one shot
df$ID <- df$ID %>%
str_replace("pi_14344894_9", "S9") %>% #oldstring, newstring
str_replace("pi_3478_03o4_15", "S15")
# Partial String match in a column----------
# __Method 1: str_detect ========
mtcars %>%
filter(str_detect(rowname, "Merc")) # this will filter everything that has the pattern "Merc" in the column
# __Method 2 : grepl ==========
dplyr::filter(mtcars, !grepl('Toyota|Mazda', type)) # Using grepl
cols <- grep(pattern = "symptoms|Previous_shunt", x = names(df), value = TRUE) #Select columns that start with symptoms
df[cols] <- as.data.frame(lapply(df[cols],function(x) {factor(x,levels=mylevels, exclude="")}))
#__Method 3 : grep (similar to grepl) =====
cols <- grep(pattern = "symptoms", x = names(df2), value = TRUE) #Select columns that start with symptoms
# Renaming a column ================
df <- df %>%
df <- df %>%
dplyr::rename(
#‘New Column Name’= ‘Old Name’
"month"= "Month"
)
# Removing NA's-----------
# __Method 1: is.na() ====
df <- df %>%
filter(!is.na(columnName)) # give me a new dataframe with only the international services.
# __Method 2: na.omit() ====
df1 <- na.omit(df1)
# Filtering our data that fits a condition: filter=========
new_df <- filter(df, service=="International") # give me a new dataframe with only the international services.
```
To rename parts of a column using `%>%` you need to use `mutate`
```{r, echo=TRUE, eval = FALSE}
# normal way to replace strings in a column
df$columnName = str_replace_all(df$columnName, "Contribution_", "")
# Using pipe
df %>%
mutate(columnName = str_replace_all(columnName, "Contribution_", ""))
```
## For Item-based data
The [`likert`](https://cran.r-project.org/web/packages/sjPlot/vignettes/sjtitemanalysis.html) package seems great. I have used it to great success in past projects.