forked from microbiome/course_2022_oulu
-
Notifications
You must be signed in to change notification settings - Fork 2
/
02-data.Rmd
187 lines (124 loc) · 5.91 KB
/
02-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE)
```
# Importing microbiome data
This section demonstrates how to import taxonomic profiling data in R.
## Data access
*ADHD-associated changes in gut microbiota and brain in a mouse model*
Tengeler AC _et
al._ (2020) [**Gut microbiota from persons with
attention-deficit/hyperactivity disorder affects the brain in
mice**](https://doi.org/10.1186/s40168-020-00816-x). Microbiome
8:44.
In this study, mice are colonized with microbiota from participants
with ADHD (attention deficit hyperactivity disorder) and healthy
participants. The aim of the study was to assess whether the mice
display ADHD behaviors after being inoculated with ADHD microbiota,
suggesting a role of the microbiome in ADHD pathology.
Download the data from
[data](https://github.com/microbiome/course_2022_radboud/tree/main/data)
subfolder.
## Importing microbiome data in R
**Import example data** by modifying the examples in the online book
section on [data exploration and
manipulation](https://microbiome.github.io/OMA/data-introduction.html#loading-experimental-microbiome-data).
The data files in our example are in _biom_ container, which is a
standard file format for microbiome data. Other file formats exist as
well, and import details vary by platform. Here, we import _biom_ data
files into a specific data container (structure) in R,
_TreeSummarizedExperiment_ (TreeSE) [Huang et
al. (2020)](https://f1000research.com/articles/9-1246).
The data container provides the basis for downstream data analysis in
the _miaverse_ data science framework.
<img src="https://raw.githubusercontent.com/FelixErnst/TreeSummarizedExperiment
/2293440c6e70ae4d6e978b6fdf2c42fdea7fb36a/vignettes/tse2.png" width="100%"/>
**Figure sources:**
**Original article**
- Huang R _et al_. (2021) [TreeSummarizedExperiment: a S4 class
for data with hierarchical structure](https://doi.org/10.12688/
f1000research.26669.2). F1000Research 9:1246.
**Reference Sequence slot extension**
- Lahti L _et al_. (2020) [Upgrading the R/Bioconductor ecosystem for microbiome
research](https://doi.org/10.7490/
f1000research.1118447.1) F1000Research 9:1464 (slides).
### Example solution
Let us first import the biom file into R / TreeSE container.
```{r import, message=FALSE, warning=FALSE}
# Load the mia R package
library(mia)
# Defining file paths
## Biom file (taxonomic profiles)
biom_file_path <- "data/Tengeler2020/Aggregated_humanization2.biom"
# Import the data into SummarizedExperiment container
se <- loadFromBiom(biom_file_path)
# Convert this data to TreeSE container (no direct importer exists)
tse <- as(se, "TreeSummarizedExperiment")
```
Check and clean up rowData (information on the taxonomic features).
```{r rowdata, message=FALSE, warning=FALSE}
# Investigate the rowData of this data object
print(head(rowData(tse)))
# We notice that the rowData fields do not have descriptibve names.
# Hence, let us rename the columns in rowData
names(rowData(tse)) <- c("Kingdom", "Phylum", "Class", "Order",
"Family", "Genus")
# We also notice that the taxa names are of form "c__Bacteroidia" etc.
# Goes through the whole DataFrame. Removes '.*[kpcofg]__' from strings, where [kpcofg]
# is any character from listed ones, and .* any character.
rowdata_modified <- BiocParallel::bplapply(rowData(tse),
FUN = stringr::str_remove,
pattern = '.*[kpcofg]__')
# Genus level has additional '\"', so let's delete that also
rowdata_modified <- BiocParallel::bplapply(rowdata_modified,
FUN = stringr::str_remove,
pattern = '\"')
# rowdata_modified is a list, so convert this back to DataFrame format.
# and assign the cleaned data back to the TSE rowData
rowData(tse) <- DataFrame(rowdata_modified)
# Recheck rowData after the modifications
print(head(rowData(tse)))
```
Next let us add sample information (colData) to our TreeSE object.
```{r coldata, message=FALSE, warning=FALSE}
## Sample phenodata
sample_meta_file_path <- "data/Tengeler2020/Mapping_file_ADHD_aggregated.csv"
# Check what type of data it is
# read.table(sample_meta_file_path)
# It seems like a comma separated file and it does not include headers
# Let us read the file; note that sample names are in the first column
sample_meta <- read.table(sample_meta_file_path, sep=",", header=FALSE, row.names=1)
# Check the data
print(head(sample_meta))
# Add headers for the columns (as they seem to be missing)
colnames(sample_meta) <- c("patient_status", "cohort",
"patient_status_vs_cohort", "sample_name")
# Add this sample data to colData of the taxonomic data object
# Note that the data must be given in a DataFrame format (required for our purposes)
colData(tse) <- DataFrame(sample_meta)
# Check the colData after modifications
print(head(colData(tse)))
```
Add phylogenetic tree (rowTree).
```{r rowtree, message=FALSE, warning=FALSE}
## Phylogenetic tree
tree_file_path <- "data/Tengeler2020/Data_humanization_phylo_aggregation.tre"
# Read the tree file
tree <- ape::read.tree(tree_file_path)
# Add tree to rowTree
rowTree(tse) <- tree
```
We can save the data object as follows.
```{r save, message=FALSE, warning=FALSE}
saveRDS(tse, file="tse.rds")
```
### Loading readily processed data
Alternatively, you can just load the already processed data set.
By using this readily processed data set we can skip the data import
step, and assume that the data has already been appropriately
preprocessed and available in the TreeSE container.
```{r, message=FALSE, warning=FALSE, eval=FALSE}
tse <- readRDS("data/Tengeler2020/tse.rds")
```
In addition to our example data, further demonstration data sets are
available in the TreeSE data container: see [OMA demo data
sets](https://microbiome.github.io/OMA/containers.html#example-data).