-
Notifications
You must be signed in to change notification settings - Fork 0
/
QC_histology_20210126.Rmd
372 lines (259 loc) · 10.8 KB
/
QC_histology_20210126.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
title: "Histologies File QC"
data: 2021-01-27
output:
pdf_document:
latex_engine: xelatex
author: "Jo Lynne Rokita, Krutika Gaonkar (D3B)"
params:
latest_histology:
label: "latest pull of base histology"
value: 20210126-data/ADAPT-base/pbta_histologies.csv
input: file
prev_histology:
label: "previous release of base histology"
value: 20201215-data/pbta-histologies.tsv
input: file
output:
label: "release folder name"
value: "20210126-data"
input: string
add_ids:
label: "add Kids_First_Biospecimen_IDs from base not in previous release"
value: ""
input: string
remove_ids:
label: "remove Kids_First_Biospecimen_IDs from previous release in ouput"
value: ""
input: string
toc: TRUE
toc_float: TRUE
editor_options:
chunk_output_type: inline
---
In this notebook we are using v18 base histology to create a base histology for v19 release. "Base histology" file has the basic clinical information manifest that is required by subtyping modules to add in OpenPBTA subtyping information.
The v18 base histologies was generated in this script: [script](https://github.com/d3b-center/D3b-codes/blob/master/OpenPBTA_v18_release_QC/QC_histology_v18.Rmd).
CNS_region values were mis-assigned by a bug in v18 which will be fixed and QC-ed as well [#14](https://github.com/d3b-center/D3b-codes/issues/14) and original issue on OpenPBTA is in [838](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/838)
```{r global_options, include=F}
knitr::opts_chunk$set(error = F, echo = T, warning = T, message = T)
```
# Load packages
```{r}
suppressMessages(library(emo))
suppressMessages(library(tidyverse))
```
# Directories and Files
## Directories
```{r}
# Input directory
input_dir <- file.path("input")
# soft linked previous release histology
prev_hist_file <- params$prev_histology
# adapt histology
latest_hist_file <- params$latest_histology
##--- KEEP LINK to G-DRIVE --- ##
# pathology diagnosis is needed to match tumor samples
# to broad/short histology
#path_dx <- read_sheet('https://docs.google.com/spreadsheets/d/1fDXt_YODcSAWDvyI5ISBVhUCu4b5-TFCVWMOwiPeMwk/edit#gid=0',range="pathology_diagnosis_for_subtyping") %>%
# dplyr::select(pathology_diagnosis,broad_histology, short_histology) %>%
# write_tsv(file.path(input_dir,"pathology_diagnosis_for_subtyping.tsv"))
# pathology free text diagnosis is needed to match to
# samples marked as "Other" in pathology_diagnosis
#path_free_text <- read_sheet('https://docs.google.com/spreadsheets/d/1fDXt_YODcSAWDvyI5ISBVhUCu4b5-TFCVWMOwiPeMwk/edit#gid=0',range="pathology_free_text_diagnosis_for_subtyping") %>%
# dplyr::select(pathology_free_text_diagnosis,broad_histology, short_histology)%>%
# write_tsv(file.path(input_dir,"pathology_free_text_diagnosis_for_subtyping.tsv"))
## ----------- ##
```
### Read in old base histology
```{r}
prev_hist <- read_tsv(prev_hist_file,
# NAs are being read as logical so specifying as character here
col_types = readr::cols(molecular_subtype = readr::col_character(),
short_histology = readr::col_character(),
integrated_diagnosis = readr::col_character(),
broad_histology = readr::col_character(),
Notes = readr::col_character()))
path_dx <- read_tsv(file.path(input_dir,"pathology_diagnosis_for_subtyping.tsv")) %>%
dplyr::select(pathology_diagnosis, broad_histology, short_histology)
path_free_text <- read_tsv(file.path(input_dir,"pathology_free_text_diagnosis_for_subtyping.tsv")) %>%
dplyr::select(pathology_free_text_diagnosis, broad_histology, short_histology)
```
### Subset new file for only those sampleIDs required
v18 but we will remove BS_JXF8A2A6 for v19 [#862](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/862)
```{r}
latest_hist <- read_tsv(latest_hist_file,
# NAs are being read as logical so specifying as character here
col_types = readr::cols(molecular_subtype = readr::col_character(),
short_histology = readr::col_character(),
integrated_diagnosis = readr::col_character(),
broad_histology = readr::col_character(),
Notes = readr::col_character()))
# get ids to subset
id_to_subset <- prev_hist %>%
pull(Kids_First_Biospecimen_ID)
```
## Add ids to previous release?
```{r}
if (params$add_ids != ""){
add_ids <- unlist(str_split(params$add_ids, ","))
# add new ids to previous releases
id_to_subset <- c( id_to_subset, add_ids)
print(paste(toString(add_ids)," added"))
}
```
### subset to previous ids ( and new ids if provided)
```{r}
# subset final histology
latest_hist <- latest_hist %>%
filter(Kids_First_Biospecimen_ID %in% id_to_subset)
```
## Check 1: Assess dimensions whether new column names match the old
### Check 1a: assess ids overlap in new and old `r emo::ji("x")`
```{r}
source("code/util/check_rows_cols.R")
check_rows(new_hist = latest_hist,old_hist = prev_hist,
column_name = "Kids_First_Biospecimen_ID")
```
### Check 1b: assess columns overlap in new and old `r emo::ji("white_check_mark")`
```{r}
check_cols(new_hist = latest_hist,old_hist = prev_hist)
```
## Check 2: Assess levels of histology columns `r emo::ji("x")`
### Check 2a: path_dx and path_free_text_dx is used to match later so should have the same values in new histology
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "pathology_diagnosis",output_dir = params$output)
```
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "pathology_free_text_diagnosis",output_dir = params$output)
```
### Check 2b: Normals, these should not have path_dx, int_dx,molecular_subtype, broad/short_hist `r emo::ji("x")`
```{r}
latest_hist_normals <- latest_hist %>%
filter(sample_type=="Normal")
prev_hist_normals <- prev_hist %>%
filter(sample_type=="Normal")
key_column_name = c("pathology_free_text_diagnosis","pathology_diagnosis","primary_site")
distinct(prev_hist_normals[,key_column_name])
distinct(latest_hist_normals[,key_column_name])
```
## Check3 tables per column changes
### Check 3a Experimental strategy `r emo::ji("x")`
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "experimental_strategy",output_dir = params$output)
```
### Check 3b Sample Type `r emo::ji("x")`
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "sample_type",output_dir = params$output)
```
### Check 3c Tumor Descriptor `r emo::ji("x")`
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "tumor_descriptor",output_dir = params$output)
```
### Check 3d Composition `r emo::ji("x")`
```{r}
# update composition with to match new terms to previous composition terms
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "composition",params$output)
```
### Check 3f RNA library `r emo::ji("x")`
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "RNA_library",output_dir = params$output)
```
### Check 3g: Cohort `r emo::ji("white_check_mark")`
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "cohort",output_dir = params$output)
```
### Check 3h: Sample and aliquot IDs - any changes? `r emo::ji("x")`
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "sample_id",output_dir = params$output)
```
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "aliquot_id", output_dir = params$output)
```
### Check 3i: Sequencing Center `r emo::ji("x")`
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "seq_center",output_dir = params$output)
```
### Check 3f: primary_site `r emo::ji("x")`
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "primary_site",output_dir = params$output)
```
## Update CNS_region
json file was generated from the CNS_region updates ticket [838](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/838).
#### Match CNS_region matching primary_site and update
```{r}
source("code/util/primary_site_matched_CNS_region.R")
latest_hist <- get_CNS_region(histology=latest_hist,
CNS_match_json=file.path(input_dir,"CNS_primary_site_match.json"))
```
Which samples had different CNS_region in v18?
```{r}
diff_cns <-latest_hist %>%
left_join(prev_hist[,c("Kids_First_Biospecimen_ID","CNS_region")],by=c("Kids_First_Biospecimen_ID"), suffix = c(".v19",".v18")) %>%
dplyr::select(Kids_First_Biospecimen_ID,CNS_region.v18,CNS_region.v19,primary_site) %>%
dplyr::filter(CNS_region.v18 != CNS_region.v19)
diff_cns
```
```{r}
diff_cns$CNS_region.v19 %>% table()
```
135 samples were incorrectly assigned CNS_region in v18. 129 on these should be 'Mixed' and 6 'Other', fixed with an updated `cns_region_check()` function in this notebook.
## Update broad_histology and short_histology
Match by pathology_diagnosis and pathology_free_text_diagnosis (Other)
#### By path_free_text for "Other" diagnosed
Only samples with 'Other' in pathology_diagnosis will be need to be matched by path_free_text
```{r}
latest_hist<- dplyr::select(latest_hist,c(-broad_histology,-short_histology))
latest_hist_other <- latest_hist %>%
dplyr::filter(pathology_diagnosis == "Other") %>%
left_join(path_free_text,by="pathology_free_text_diagnosis")
```
#### By path_dx for all tumors other than "Other"
Remove samples with 'Other' in pathology_diagnosis that was already matched above
```{r}
latest_hist <- latest_hist %>%
dplyr::filter(pathology_diagnosis != "Other"|
# add Normals
sample_type=="Normal") %>%
left_join(path_dx,by="pathology_diagnosis") %>%
bind_rows(latest_hist_other)
```
### Check broad_histology
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "broad_histology",output_dir = params$output)
```
### Check short_histology
```{r}
check_values(new_hist = latest_hist,old_hist = prev_hist,
column_name = "short_histology",output_dir = params$output)
```
## Remove ids from previous release?
```{r}
if (params$remove_ids != ""){
remove_ids <- unlist(str_split(params$remove_ids, ","))
# remove ids from previous releases
latest_hist <- latest_hist %>%
filter(!Kids_First_Biospecimen_ID %in% remove_ids)
print(paste(toString(remove_ids)," removed"))
}
```
# Write new file
```{r}
write.table(latest_hist, file.path(params$output,"pbta-histologies-base.tsv")
, sep = "\t", quote = F, col.names = T, row.names = F)
```
```{r}
sessionInfo()
```