-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_tidyng_data.Rmd
193 lines (124 loc) · 7.03 KB
/
01_tidyng_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
title: "Tidying Data with dplyr and tidyr"
author: "Axel R"
date: "2024-01-20"
output:
rmdformats::robobook:
self_contained: true
thumbnails: true
lightbox: true
gallery: false
highlight: tango
---
```{r rmdformats, echo=FALSE, message=FALSE, warning=FALSE}
library(rmdformats)
```
**Remark**: The material is based on an activity assigned to us by PhD. Renan Escalante.
**Objective:** In this exercise, you will learn how to tidy a dataset using the dplyr and tidyr packages in R. Tidying data is an essential step in data preparation for analysis. We will use a gene expression dataset as an example.
**Prerequisites:** You should have R installed and be familiar with basic data manipulation in R.
**Dataset:** We will use the gene expression dataset provided in the blog post.
# 1. Load the Data
Load the gene expression dataset using the `readr` package and view its dimensions
```{r Load the data, message=FALSE, warning=FALSE}
library(readr)
data <- read_csv("microarray_expression_data.csv", col_names = TRUE)
```
1. ***What is the purpose of loading the dataset in this step?***
The `readr` package is a part of the tidyverse in R and is designed for **efficient reading of flat-file data**, such as CSV or TSV files, into R data frames.
2. ***What does the `dim()` function do? How about `str()`?***
`dim` function will tell us the dimensions of the dataset we're using. On the other hand, `str()` is used to display the structure of the dataset, providing an overview of the internal structure of the object.
```{r Structure of Data}
dim(data)
str(data)
```
In this case, we have a dataset (`data`) with 5,537 rows and 40 columns. Also, we have several variables in form of character and number.
# 2. Understanding the Untidy Data
Before we start tidying the data, let's understand what makes this dataset untidy. Refer to the blog post to answer the following questions:
1. ***What are the main reasons that make this dataset untidy?***
The column called `NAME` has several empty spaces and it has information that doesn't correspond just to the name of the gene, but instead other characteristics that should be encapsulated in other columns.
2. ***How are the column headers (e.g., `G0.05`, `N0.3`) structured in this dataset?***
They are structured from lowest to highest concentration (e.g. G0.05, G0.1, G0.15, G0.2 G0.3).
# 3. Separate Multiple Variables in One Column
The `NAME` column contains multiple pieces of information separated by '\|\|'. We need to split this column into separate columns. Perform this operation using the tidyr package's `separate` function.
```{r separate function, eval=FALSE, message=FALSE, warning=FALSE, include=FALSE}
help("separate")
```
```{r separating data, message=FALSE, warning=FALSE}
library(tidyr)
data_clean <- separate(data = data, col = NAME, into = c("Name", "Function", "Details", "Mutation", "Position"), sep = "\\|\\|", remove = T)
```
1. ***What are the five new columns created after using the `separate` function?***
I created the columns `Name`, `Function`, `Details`, `Mutation` and `Position`
2. ***What does the `sep` argument specify in the `separate` function?***
How the `NAME` column is gonna be split.
# 4. Remove Whitespace
Some columns have leading and trailing whitespace. Clean these columns
```{r help mutate, eval=FALSE, include=FALSE}
help(mutate)
```
```{r removing whitespaces, message=FALSE, warning=FALSE}
library(dplyr)
library(stringr)
data_clean <- data_clean %>%
mutate(Name = str_squish(Name))
```
1. ***Why is it important to remove leading and trailing whitespace from columns?***
Because empty spaces can cause errors when running other functions, as well as making the graphs look less clean.
2. ***How is the `mutate` function used in this step?***
We are using the **`mutate()`** function to modify a column in the dataframe **`data_clean`**.
- **`Name`**: This is the name of the column that will be modified.
- **`str_squish(Name)`**: The **`str_squish()`** function from the stringr package is applied to the **`Name`** column. It removes leading, trailing, and excess internal whitespace from character strings.
# 5. Remove Unnecessary Columns
Some columns are not needed for our analysis. Remove columns '`number`', '`GID`', '`YORF`', and '`GWEIGHT`'
```{r Removing columns}
data_clean <- data_clean %>%
select(c(-Position, -GID, -YORF, -GWEIGHT))
# str(data_clean)
```
1. ***Why is it important to remove unnecessary columns in data cleaning?***
It is a crucial step in the data cleaning process as it improves the overall quality, usability, and efficiency of the dataset, making it more suitable for analysis and modeling.
2. ***How do tidyselect functions work in removing columns?***
`select()` function is used to select or drop columns from a dataframe. **`c(-Position, -GID, -YORF, -GWEIGHT)`** is a vector specifying the columns to be dropped. The **`-`** sign before each column name indicates that those columns should be removed.
# 6. Reshape Data
The column headers like `G0.05`, `N0.2`, etc., represent multiple variables.
```{r pivot_longer info, eval=FALSE, message=FALSE, warning=FALSE, include=FALSE}
help("pivot_longer")
vignette("pivot")
relig_data <- relig_income
relig_data_clean <- relig_income %>%
pivot_longer(
cols = !religion,
names_to = "income",
values_to = "count"
)
```
```{r}
data_clean <- data_clean %>%
pivot_longer(
cols = starts_with(c("G0", "N0", "P0", "S0", "L0", "U0")),
names_to = "Samples",
values_to = "Rate"
)
```
1. ***What does the `pivot_longer` function do in this step?***
`pivot_longer()` makes datasets **longer** by increasing the number of rows and decreasing the number of columns.
2. ***What are the new columns created after using the `pivot_longer` function?***
In this case, we take all the information that came from the columns `G0.05`, `N0.2`, etc. and merge it in a new column called "`Info`" [`Info` will contain the name of the header (`G0.05`, `N0.1`, etc.), and `Count` will have the value that correspond to each]
```{r data_pivoted, echo=TRUE}
View(data_clean)
```
# 7. Separate Variables in One Column
The `sample` column contains two variables: nutrient and rate.
1. ***What does the `separate` function do in this step?***
2. ***Why do we use `convert = TRUE` in the `separate` function?***
# 8. Visualize the Tidied Data
Visualize the tidied data using `ggplot2` for a gene of interest (e.g., "LEU1") and a biological process of interest (e.g., "leucine biosynthesis").
1. ***What is the purpose of visualizing data in this step?***
2. ***How does `ggplot2` help in creating visualizations?***
# 9. Explore Other Biological Processes
Explore another biological process (e.g., "sulfur metabolism") using ggplot2 and visualize the gene expression patterns.
1. ***Why is it important to explore different biological processes in the dataset?***
2. ***How can you use `facet_wrap` function do in this step?***
# 10. Proffesor's Script
```{r eval=FALSE, message=FALSE, warning=FALSE, include=FALSE}
```