-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpreprocessing.qmd
1802 lines (1490 loc) · 54.5 KB
/
preprocessing.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Preprocessing"
author:
- name: "Nathan Constantine-Cooke"
corresponding: true
url: https://scholar.google.com/citations?user=2emHWR0AAAAJ&hl=en&oi=ao
affiliations:
- ref: CGEM
- ref: HGU
- name: "Karla Monterrubio-Gómez"
url: https://scholar.google.com/citations?user=YmyxSXAAAAAJ&hl=en
affiliations:
- ref: HGU
- ref: CGEM
- name: "Catalina A. Vallejos"
url: https://scholar.google.com/citations?user=lkdrwm0AAAAJ&hl=en&oi=ao
affiliations:
- ref: HGU
#comments:
# giscus:
# repo: quarto-dev/quarto-docs
editor_options:
markdown:
wrap: 72
---
## Introduction
```{R Package load}
#| message: false
#| warning: false
#| cache: false
set.seed(123)
### Required packages
library(tidyverse) # ggplot2, dplyr, and magrittr
library(knitr) # Markdown utilities
library(pander) # Pretty markdown rendering
library(datefixR) # Standardising dates
library(lubridate) # Date handling
library(patchwork) # Arranging plots
```
This page details preprocessing steps undertaken prior to model fitting.
These steps include examining data quality, performing data cleaning,
and deriving the study cohort.
The biomarker (FC and CRP) data used in this pipeline have been
primarily obtained from TRAK, a system used by NHS Lothian for the
electronic ordering of laboratory tests. However, we also use
phenotyping data manually curated by the Lees group for previous studies
(see @Jenkinson2020).
Prior to this report, all subject CHI (community health index) numbers,
which uniquely identifies a patient when they interact with NHS Scotland
services, were pseudoanonymised. Each CHI has been replaced with a
unique random number.
### Inclusion/exclusion criteria definitions {#inclusionexclusion-criteria-definitions}
The inclusion criteria for this study can broadly be categorised into
two classifications: baseline and longitudinal. Baseline criteria can be
applied using information known at diagnosis whilst longitudinal
criteria are based on biomarker measurements taken over time.
The longitudinal criteria for each biomarker (FC and CRP) are considered
independently. If a subject meets the below criteria for FC but not CRP,
then only the FC for the subject will be modelled and vice versa.
::: panel-tabset
#### Baseline
##### Diagnosis of Inflammatory bowel disease
As inflammatory bowel disease (IBD) is the disease of interest for this
study, subjects are required to have a confirmed diagnosis of either
Crohn's disease, ulcerative colitis, or inflammatory bowel disease
unclassified.
##### Diagnosis date
Study subjects are required to have been diagnosed with IBD between
October 2004 and December 2017. The lower bound is required as FC tests
were not introduced until 2005 and we require study subjects to have a
FC available close to diagnosis (see the longitudinal tab). The upper
bound is required to ensure subjects have the opportunity to have at
least five years of follow-up. The most recent FC observation in this
dataset is January 2023 which informed when deciding upon this cut-off.
#### Longitudinal
##### Diagnostic measurement
We require study subjects to have a biomarker measurement taken ±90 days
of reported date of diagnosis.
##### At least three non-censored measurements
To ensure stability and interpretability, study subjects are also
required to have at three non-censored biomarker measurements taken
across follow-up. Like most biomarkers, the biomarkers used in this
study are subject to censoring. However, these censored observations
make it impossible to detect changes over time and we therefore do not
wish to be reliant on censored observations.
:::
### Datasets
```{R Read files}
if (file.exists("/.dockerenv")) { # Check if running in Docker
# Assume igmm/Vallejo-predict/libdr/ is passed to the data volume
prefix <- "data/"
} else {
# Assume running outside of a Docker container and the IGC(/IGMM) datastore is
# mounted at /Volumes
prefix <- "/Volumes/igmm/cvallejo-predicct/libdr/"
}
fcal.pheno <- read.csv(file.path(prefix, "2024-10-03/fcal-cleaned.csv"))
# Extract from TRAK which also now introduces CRP.
labs <- read.csv(file.path(prefix, "2024-10-03/markers-cleaned.csv"))
fcal <- subset(labs, TEST == "f-Calprotectin-ALP")
# Add sex and diagnosis type from fcal.pheno
fcal <- fcal.pheno %>%
distinct(ids, .keep_all = TRUE) %>%
select(ids, sex, diagnosis, diagnosis_date) %>%
merge(x = fcal, by = "ids", all.x = TRUE, all.y = FALSE)
crp <- subset(labs, TEST == "C-Reactive Prot")
updated <- read.csv(file.path(prefix, "2024-10-03/allPatientsNathanCleaned.csv"))
outcomes <- read.csv(file.path(prefix, "2024-10-03/cd-cleaned.csv"))
non.ibd <- read.csv(file.path (prefix, "2024-10-24/non-ibd.csv"))
```
Multiple datasets are used in this report with varying degrees of
previous curation.
- `markers.csv` is a 2024 extraction from TRAKcare describing
`r nrow(labs)` test results (FC: `r nrow(fcal)`; CRP:
`r nrow(crp)`). Each row represents a test result with columns
indicating the corresponding subject ID, the test type (FC or CRP),
the time of the test, and the test result. This dataset contains the
most recent data out of the other datasets. However, there has been
no manual curation of these data, and no information about the
patient's characteristics are available (other than their ID).
- `fcal-cleaned.csv` does not describe CRP data. However, these data
have been curated by the Lees group. In addition to
`r nrow(fcal)` FC test results, each row also provides the sex, IBD
type, date of diagnosis, and date of death (if applicable) for the
corresponding subject.
- `allPatientsNathanCleaned.csv` provides sex, IBD type, and diagnosis
date for most (but not all) subjects in `labs.csv` not described by
`fcal-cleaned.csv`.
- `cd-cleaned.csv` describes CD patient data (n = `r nrow(outcomes)`).
In addition to the same basic patient characteristic variables in
other datasets, this dataset also describes surgical outcomes and
biologic treatments prescribed to the subject.
- `non-ibd.csv` describes subjects who were found to not have IBD during the
phenotyping process. This dataset is used to remove subjects who do not have
IBD from the analysis.
::: callout-note
The "cleaned" in the above file names refers to the CHI numbers having
been replaced. No further processing of the data has been undertaken for
this study prior to the steps outlined in this report.
:::
## Subject data
We will create a "dictionary" which, for each subject, lists their
subject ID, their IBD type, and their date of diagnosis. In addition to
ensuring inclusion criteria are met, date of diagnosis will be used to
retime all biomarker measurements of a subject such that $t=0$
corresponds to their diagnosis date. IBD types are used to test if
clusters are enriched by IBD type and may be used to adjust the
clustering.
There are multiple potential sources for date of diagnosis and IBD type:
the CD outcomes dataset, a demographics dataset maintained by the Lees
group, and a raw TRAK extraction. For every subject listed in at least
one of these datasets, we have tried to extract their date of diagnosis
and diagnosis type. We have assumed a hierarchy of datasets where
manually curated datasets are preferred for extracting data over a
dataset generated with minimal human involvement.
### Diagnosis and diagnosis date
```{R Create subject dictionary}
ids <- unique(c(outcomes$ids, fcal$ids, updated$ids))
diagnosis <- character()
date.of.diag <- character()
for (id in ids) {
if (id %in% outcomes$ids) {
# Outcomes data only contains CD subjects
diagnosis <- c(diagnosis, "Crohn's Disease")
date.of.diag <- c(
date.of.diag,
subset(outcomes, ids == id)$diagnosisDate
)
} else if (id %in% updated$ids) {
diagnosis <- c(diagnosis, subset(updated, ids == id)[1, "diagnosis"])
date.of.diag <- c(
date.of.diag,
subset(updated, ids == id)[1, "diagnosisDate"]
)
} else if (id %in% fcal$ids) {
diagnosis <- c(diagnosis, subset(fcal, ids == id)[1, "diagnosis"])
date.of.diag <- c(
date.of.diag,
subset(fcal, ids == id)[1, "diagnosis_date"]
)
} else {
diagnosis <- c(diagnosis, "Unknown")
date.of.diag <- c(
date.of.diag,
"Unknown"
)
}
}
dict <- data.frame(
ids = ids,
diagnosis = diagnosis,
date.of.diag = fix_date_char(date.of.diag)
)
rm(id, ids, diagnosis, date.of.diag) # clean up
dict <- fix_date_df(dict, "date.of.diag")
```
Before filtering by inclusion/exclusion criteria, there are
`r sum(is.na(dict$date.of.diag))` individuals for which a diagnostic
date is missing. However, `r sum(is.na(dict$diagnosis))` missing values
are observed for diagnosis (IBD type).
As can be seen in the below table, two spellings of Crohn's disease (with and
without an apostrophe (')) and three names for IBDU are used. There are also
subjects reported to not have IBD. These will be removed, leaving
`r sum(dict$diagnosis != "Not IBD") - nrow(non.ibd)` subjects included in the
subsequent analyses.
```{R Frequency table to diagnosis pre-process}
#| label: tbl-diag-pre
#| tbl-cap: Reported IBD types
kable(unique(dict$diagnosis),
col.names = c("Diagnosis")
)
dict <- dict %>%
subset(diagnosis != "Not IBD") %>%
subset(!(ids %in% non.ibd$ids))
```
The Crohn's disease and IBDU names have been standardised (IBDU,
Inflammatory Bowel Disease, Inflammatory Bowel Disease - Unknown Subtype
are assumed to be equivalent). The subject reported to not have IBD has
been removed.
```{R Contingency table of IBD diagnosis pre and post processing}
#| label: tbl-diag-post
#| tbl-cap: Contingency table showing mapping of IBD types to a standardised format
#| results: "hold"
dict$old <- dict$diagnosis
dict$diagnosis <- plyr::mapvalues(
dict$diagnosis,
from = c(
"Crohns Disease",
"Inflamatory Bowel Disease",
"Inflamatory Bowel Disease - Unknown Subtype"
),
to = c(
"Crohn's Disease",
"IBDU",
"IBDU"
)
)
kable(addmargins(table(dict$diagnosis, dict$old), margin = 2))
dict$old <- NULL
```
### Age and sex
```{r add sex to dict}
# Merge with fcal to add sex information
dict <- fcal.pheno[, c("ids", "sex")] %>%
distinct(ids,
.keep_all = TRUE
) %>%
merge(x = dict, by = "ids", all.x = TRUE, all.y = FALSE)
# Update NA sex if sex available from updated
dict <- merge(dict,
updated[, c("ids", "sex")],
by = "ids",
all.x = TRUE,
all.y = FALSE
)
for (i in seq_len(nrow(dict))) {
if (is.na(dict[i, "sex.x"]) && !is.na(dict[i, "sex.y"])) {
dict[i, "sex.x"] <- dict[i, "sex.y"]
}
}
dict$sex <- dict$sex.x
dict$sex.x <- dict$sex.y <- NULL
# Add age at IBD diagnosis
updated <- fix_date_df(updated, "diagnosisDate")
updated$age <- with(updated, year(diagnosisDate) - dateOfBirth)
dict <- merge(dict,
updated[, c("ids", "age")],
by = "ids",
all.x = TRUE,
all.y = FALSE
)
```
Note that there `r sum(is.na(dict$age))` individuals for which age at
diagnosis is missing (this is due to missing diagnostic dates). However,
there are `r sum(is.na(dict$sex))` individuals for which sex is missing.
### Date of death
```{r add date.of.death}
dict <- merge(dict,
updated[, c("ids", "death")],
by = "ids",
all.x = TRUE,
all.y = FALSE
) %>%
rename(date.of.death = death)
dict <- fix_date_df(dict, "date.of.death")
```
Date of death information is available for
`r sum(!is.na(dict$date.of.death))` individuals. We assume that all
remaining individuals have not died by the date of data extraction.
## Exclusion criteria according to diagnosis and date of diagnosis
Date of diagnosis in an integral aspect of the inclusion criteria and
also used to retime biomarker measurements.
Firstly, we remove `r sum(is.na(dict$date.of.diag))` individuals with a missing date of diagnosis.
```{R Remove NA date of diagnosis}
dict <- dict[!is.na(dict$date.of.diag), ]
```
This leaves `r nrow(dict)` for subsequent analyses.
```{R Apply date of diagnosis exclusion}
# no. subjects over upper bound
max_fup <- max(labs$COLLECTION_DATE)
max_fup <- as.Date(max_fup)
n.upper <- nrow(subset(dict, max_fup - date.of.diag < 5 * 365.25))
# no. subjects under lower bound
n.lower <- nrow(subset(dict, (year(date.of.diag) < 2004) |
(year(date.of.diag) == 2004 & month(date.of.diag) <= 9)))
# subset to subjects meeting the criteria.
dict <- subset(dict, max_fup - date.of.diag >= 5 * 365.25)
dict <- subset(dict, (year(date.of.diag) >= 2005) |
(year(date.of.diag) == 2004 & month(date.of.diag) >= 10))
```
As described in the [inclusion/exclusion criteria
section](#inclusionexclusion-criteria-definitions), the inclusion criteria
considers upper and lower bounds for the diagnostic dates. In this case, we
only consider subjects diagnosed after October 2004 and 13 August 2019 (this
ensures individuals can be followed up by at least 5 year by the time of the
last longitudinal measument obtained for the cohort; `r max_fup`). As
such, subjects who do not meet this criteria will be excluded.
`r n.upper` subjects were diagnosed on 13 August 2019 or later and were removed,
and `r n.lower` subjects were diagnosed prior to October 2004 and were
likewise removed. This results in a cohort size of `r nrow(dict)`
subjects.
```{R Clean up excluded date of diagnosis counts}
#| include: false
rm(n.upper, n.lower) # clean up
```
## Exploration
<!--- CAV: we should move this into an EDA script --->
We now explore date of diagnosis to ensure data quality is at a suitable
standard.
##### Year of diagnosis
From @fig-diag-year, we can see the number of IBD diagnoses each year is
relatively static with the exception of 2004 and 2005. The former
describes only three months of the year, and incidence was likely still
increasing across both of these two years. @Jones2019 found IBD
incidence from 2008 onwards to be consistent over time for patients
diagnosed by NHS Lothian which is in agreement with our findings.
```{R Year of diagnosis}
#| label: fig-diag-year
#| fig-cap: "Distribution of year of diagnosis"
dict %>%
ggplot(aes(x = year(date.of.diag))) +
geom_histogram(fill = "#B8F2E6", color = "black", binwidth = 1) +
theme_minimal() +
xlab("Year of IBD diagnosis") +
ylab("Frequency")
```
##### Month of diagnosis
It appears there are subjects for whom only year of diagnosis was
available. This has resulted in the 1<sup>st</sup> of January for that
year being recorded as the date of diagnosis for that subject. As such,
far more diagnoses are reported in January than in other months
(@fig-diag-month). There are
`r sum(day(dict$date.of.diag) == 1 & month(dict$date.of.diag) == 1)`
subjects reported to have been diagnosed on New Year's day .
If a subject's exact date of diagnosis is not known, then this is most
likely because the subject was not diagnosed by NHS Lothian. Instead,
the date of diagnosis would have needed to be recalled by the subject
which introduces the inaccuracy observed.
As subjects are require a diagnostic biomarker measurement to be
available within 90 days of reported date of diagnosis to be included in
this study and patients diagnosed outside of NHS Lothian will not have a
biomarker test result available within this period, the effect on our
study should be minimal. However, we will revisit month of diagnosis
after filtering by existence of diagnostic biomarker measurements to
confirm our assumption.
```{R Year of diagnosis (preprocessed)}
#| label: fig-diag-year-processed
#| fig-cap: "Bar plot of year of diagnosis"
dict %>%
ggplot(aes(x = as.factor(year(date.of.diag)))) +
geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) +
theme_minimal() +
ylab("Frequency") +
xlab("Year of IBD Diagnosis")
```
```{R Month of diagnosis (preprocessed)}
#| label: fig-diag-month
#| fig-cap: "Bar plot of month of diagnosis"
dict %>%
ggplot(aes(x = as.factor(month(date.of.diag, label = TRUE)))) +
geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) +
theme_minimal() +
ylab("Frequency") +
xlab("Month of IBD Diagnosis")
```
```{r diagnostic trend}
diag.ts <- dict %>%
mutate(
date.of.diag.month =
as.Date(paste0(format(date.of.diag, "%Y-%m"), "-01"), format = "%Y-%m-%d")
) %>%
group_by(date.of.diag.month) %>%
summarise(count = n()) %>%
ungroup()
diag.ts %>%
ggplot(aes(x = as.Date(date.of.diag.month, "%Y-%m"), y = count, group = 1)) +
geom_line() +
xlab("Diagnosis date") +
ylab("Number of diagnosis") +
scale_x_date(date_labels = "%Y-%m") +
theme_minimal()
```
## Faecal calprotectin
Faecal calprotectin (FC), a marker of intestinal inflammation, has been
obtained from two datasets. The first dataset has been curated by
members of the Lees group whilst the second dataset is a direct extract
from TRAK, a patient monitoring system used by NHS Lothian.
### Incorporating later extract
Data from these datasets has been merged and duplicates of data (same
subject ID, measurement date, and recorded value) were removed. We also
reduced the FC dataset to only describe subjects for whom IBD type and
date of diagnosis are available for.
For the TRAK extract, times for test results are given in datetime
format (where both date and time of the day are provided). The times
have been dropped as this degree of granularity is not required.
Duplicate tests (same id, date and test value) are removed here.
```{R FCAL merge}
fcal <- fcal[, c(
"ids",
"COLLECTION_DATE",
"TEST_DATA",
"sex",
"diagnosis",
"diagnosis_date"
)]
# Subset to only include those that passed the earlier inclusion/exclusion
fcal <- subset(fcal, ids %in% dict$ids)
# Collection dates include collection times which are not required. Discarding.
fcal$COLLECTION_DATE <- readr::parse_date(
stringr::str_split_fixed(fcal$COLLECTION_DATE, " ", n = 2)[, 1],
format = "%Y-%m-%d"
)
colnames(fcal)[1:3] <- c("ids", "calpro_date", "calpro_result")
fcal <- subset(fcal, ids %in% dict$ids)
fcal <- fcal %>%
select(-diagnosis)
fcal <- merge(fcal,
dict[, c("ids", "diagnosis", "date.of.diag")],
by = "ids",
all.x = TRUE
)
```
### Pre-processing of test results
```{R FCAL censor mapping}
#| warning: false
## Some values cannot be directly coerced as numeric
fcal$calpro_result <- plyr::mapvalues(
fcal$calpro_result,
from = c(
"<20",
"<25",
"<50",
">1250",
">2500",
">3000",
">6000"
),
to = c(
"20",
"25",
"50",
"1250",
"2500",
"3000",
"6000"
)
)
# Here, values that cannot be converted into a numeric value will be excluded
# (they are converted to NA)
fcal$calpro_result <- as.numeric(fcal$calpro_result) # Remove error codes
fcal <- fcal[!is.na(fcal[, "calpro_result"]), ]
# Apply limits of detection
fcal[fcal[, "calpro_result"] < 20, "calpro_result"] <- 20
fcal[fcal[, "calpro_result"] > 1250, "calpro_result"] <- 1250
```
FC data can be both left and right censored. FC recorded as "\<20" were
mapped to "20". FC recorded as "\>1250", "\>2500", or "\>6000" were all
mapped to "1250". Any FC tests given an error code (e.g. marked as an
insufficient sample by the laboratory) have been removed.
::: {.callout-note collapse="true"}
#### More information on FC censoring
The lower limit of detection is $<20 \mu g/g$ ($20 \mu g$ of
calprotectin per $g$ of stool). By reducing the upper limit, it is
possible to run more tests in parallel. As a higher throughput has been
required over time, the upper threshold for tests has become lower.
Initially test results over $6000 \mu g/g$ were censored, then
$2500 \mu g/g$ and now $1250 \mu g /g$ is the upper limit for FC tests.
This change has resulted in minimal impact in clinics as $1250 \mu g/g$
is still considered a high result. However, this change has potential
implications for research.
:::
```{r fig-fcal-meas}
#| label: "fig-fcal-meas"
#| fig-cap: "Density plot of FCAL measurements by observed value"
#| warning: false
fcal %>%
ggplot(aes(x = calpro_result)) +
geom_density(
linewidth = 0.8,
alpha = 0.5,
fill = "#9FD8CB",
color = "#517664"
) +
theme_minimal() +
theme(axis.text.y = element_blank()) +
xlab("FCAL (µg/g)") +
ylab("Density") +
geom_vline(xintercept = 250, colour = "red")
```
FC test results on the original measurement scale are heavily
right-skewed towards $100 \mu g/g$. These data will be log transformed-
resulting in the multi-modal distribution seen in @fig-logfcal-meas.
```{R}
#| label: "fig-logfcal-meas"
#| fig-cap: "Density plot of logged FCAL test results"
fcal %>%
ggplot(aes(x = log(calpro_result))) +
geom_density(
linewidth = 0.8,
alpha = 0.5,
fill = "#9FD8CB",
color = "#517664"
) +
theme_minimal() +
theme(axis.text.y = element_blank()) +
xlab("log(FCAL (µg/g))") +
ylab("Density") +
geom_vline(xintercept = log(250), colour = "red")
```
### Removal of duplicate FCAL measurements
```{R}
duplicated.ids <- c()
for (id in unique(fcal$ids)) {
sub.fcal <- subset(fcal, ids == id) # Get FC data for a subject
sub.fcal <- sub.fcal[order(sub.fcal$calpro_date), ] # Order by dates
time.diff <- diff(sub.fcal$calpro_date) # Find time between ordered dates
value.diff <- diff(sub.fcal$calpro_result) # Find difference in observed values
# If two measurements are within 10 days of each other and have the same value
if (any(time.diff <= 10 & value.diff == 0)) {
duplicated.ids <- c(duplicated.ids, id)
# Remove suspected duplicates (taking into account difference is lagged)
sub.fcal <- sub.fcal[c(TRUE, !(time.diff <= 10 & value.diff == 0)), ]
# Remove data for subject with duplicates
fcal <- subset(fcal, ids != id)
# Add non duplicated data back
fcal <- rbind(fcal, sub.fcal)
}
}
```
Given FC was retrospectively collected from observational data, it is possible
duplicate tests have been recorded in the extract. As it is NHS Lothian policy
to not test FC days apart, we can assume that any tests with the same
observation within a small time frame are likely to be duplicates. This may
occur when collection dates and testing dates have been used interchangeably.
If a subsequent FC test for a subject was dated within 10 days of a previous
test and has the same observed value, this observation was dropped.
```{R}
htmltools::p(paste(
"There are",
length(duplicated.ids),
"subjects who appear to have duplicated FC measurements"
))
```
### Retiming - part 1
Time of FC measurements were retimed and scaled to be the number of
years since IBD diagnosis.
```{R FCAL time mapping}
#| label: fig-fcal-spag-pre
#| fig-cap: "Spaghetti plot of FC trajectories (preprocessed)"
# Dates have already been converted to Date class by fix_date_char() for dict
fcal$calpro_time <- as.numeric(fcal$calpro_date - fcal$date.of.diag) / 365.25
fcal %>%
ggplot(aes(x = calpro_time, y = log(calpro_result), color = factor(ids))) +
geom_line(alpha = 0.2) +
geom_point(alpha = 0.6) +
theme_minimal() +
scale_color_manual(values = viridis::viridis(length(unique(fcal$ids)))) +
guides(color = "none") +
xlab("Time (years)") +
ylab("Log(FCAL (µg/g))") +
ggtitle("")
```
After retiming, it is clear some FC observations are earlier than diagnosis and
in some cases substantially earlier (@fig-fcal-spag-pre). Tests taken close to
the reported date of diagnosis are likely "early" as a result of diagnostic
delay whereas tests from much earlier were likely requested due to other
conditions.
As such, FC results earlier than 90 days prior to diagnosis will be discarded.
If a subject has an FC within 90 days before diagnosis, then all of their FC
tests will be retimed such that their earliest FC within this period is equal to
0 and all later measurements are shifted accordingly to maintain the same
differences between measurement times.
```{r remove prediag fcal}
fcal <- subset(fcal, calpro_time >= -0.25)
```
### Inclusion/exclusion: removal of subjects without a diagnostic FC test
As indicated in our inclusion/exclusion criteria, we reduce the dataset to only
subjects with a diagnostic FC. This equates to subjects with at least one FCAL
measurement within 3 months of diagnosis $t \leq 0.25$.
```{R FCAL retiming}
diagnostic <- fcal %>%
group_by(ids) %>%
summarise(n = sum(calpro_time < 0.25)) %>%
subset(n > 0)
fcal <- subset(fcal, ids %in% diagnostic$ids)
```
After this exclusion, `r length(unique(fcal$ids))` subjects remain in the data.
```{R}
##The following code is used to save the cleaned data generated so far.
diag.time <- c()
fc.ids <- unique(fcal$ids)
for (id in fc.ids) {
temp <- subset(fcal, ids == id)
temp <- temp[order(temp$calpro_time), ]
diag.time <- c(diag.time, temp[1, "calpro_time"])
}
fc.dist <- data.frame(ids = fc.ids, diagnostic = diag.time)
if (!dir.exists(paste0(prefix, "processed"))) {
dir.create(paste0(prefix, "processed"))
}
saveRDS(fc.dist, paste0(prefix, "processed/fc-diag-dist.RDS"))
```
### Retiming - part 2
Here, we show the distribution of the diagnostic FCAL measurements with
respect to diagnosis date (in days).
```{r fcal days before diag}
p <- fcal %>%
group_by(ids) %>%
filter(calpro_time == min(calpro_time)) %>%
ggplot(aes(x = calpro_time * 365.25)) +
geom_density(fill = "#20A39E", color = "#187370") +
theme_minimal() +
labs(
y = "Density",
x = "Time from diagnosis to first faecal calprotectin (days)"
) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red")
ggsave("plots/fc-diagnostic-dist.png",
p,
width = 16 * 2 / 3,
height = 6,
units = "in"
)
```
Although most diagnostic measurements were observed around the recorded
diagnosis time (near zero), this is not always the case. As such, we
decided to retime measurements/diagnosis time as described above (i.e. time = 0
matches the time of first available FCAL measurement). This is only applied to
individuals for which the first available CRP is prior to the recorded diagnosis
time.
```{R}
#| label: fig-fcal-spag-post
#| fig-cap: "Spaghetti plot of FC trajectories (processed)"
# Retime so that t_0 = 0.
for (id in unique(fcal$ids)) {
temp <- subset(fcal, ids == id)
if (any(temp$calpro_time < 0)) {
add <- sort(temp$calpro_time)[1]
fcal[fcal[, "ids"] == id, "calpro_time"] <-
fcal[fcal[, "ids"] == id, "calpro_time"] + abs(add)
}
}
length(unique(fcal$ids))
fcal %>%
ggplot(aes(x = calpro_time, y = log(calpro_result), color = factor(ids))) +
geom_line(alpha = 0.2) +
geom_point(alpha = 0.6) +
theme_minimal() +
scale_color_manual(values = viridis::viridis(length(unique(fcal$ids)))) +
guides(color = "none") +
xlab("Time (years)") +
ylab("Log(FCAL (µg/g))") +
ggtitle("")
```
At this stage, we also exclude FCAL measurements recorded beyond 7 years post
diagnosis.
```{r remove fcal after 7 years}
fcal <- subset(fcal, calpro_time <= 7)
length(unique(fcal$ids))
```
At this point, the data contains `r nrow(fcal)` FCAL measurements across all
`r length(unique(fcal$ids))` individuals.
### Inclusion/exclusion: a minimum number of FCAL measurements
Here, we summarise the number of observations per individual (with and without
considering censored values) as well as length of follow-up. The latter is
defined as the difference (in years) between diagnosis time and the time of the
latest FCAL measurement.
```{r}
countsDF <- fcal %>%
group_by(ids) %>%
summarise(
n.total = n(),
censored.left = sum(calpro_result == 20),
censored.right = sum(calpro_result == 1250),
n.noncensored = n.total - censored.left - censored.right,
n.negtime.nondiag = sum(calpro_time != 0 & calpro_date - date.of.diag < 0),
followup = max(calpro_time)
)
```
```{R fcal followup pre exclusions}
#| label: fig-fcal-follow-up
#| fig-cap: "(A) Histogram of the number of FCAL measurements per subject. (B) Histogram of FCAL follow-up per subject (before exclusions). (C) Scatterplot with number of FCAL measurements vs follow-up."
#| fig-width: 12
#| column: body-outset
p1 <- countsDF %>%
ggplot(aes(x = n.total)) +
geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
theme_minimal() +
ylab("Count") +
xlab("Measurements per subject")
p2 <- countsDF %>%
ggplot(aes(x = followup)) +
geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
theme_minimal() +
ylab("Count") +
xlab("Follow-up (years)")
p3 <- countsDF %>%
ggplot(aes(y = n.total, x = followup)) +
geom_hex() +
xlab("Follow-up (years since diagnosis)") +
ylab("Number of FC measurements") +
theme_minimal() +
labs(fill = "Count")
p1 + p2 + p3 + plot_annotation(tag_levels = "A")
```
Summary observations:
- The number of FC measurements per subject is skewed towards low values.
However, there are also subjects with many FC observations. These subjects were
investigated and found to often have a complex disease course, such as acute
severe ulcerative colitis, and required close monitoring as a result.
- We do not directly require a minimum follow-up for a subject as this may bias
our findings. For example, UC patients who undergo a proctocolectomy
(surgical removal of the colon and rectum) are less likely to have FC
measurements after their surgery than patients who did not require
surgery. By applying criteria based on follow-up, we could indirectly
exclude subjects based on disease outcomes.
- However, the requirement of at least 3 FCAL measurements (within 7 years
from diagnosis) will indirectly remove individuals with a very short follow-up time.
Note that our inclusion criteria requires at least 3 non-censored FCAL
measurements per individual. As such, censoring needs to be taken into account
before applying filtering out individuals.
```{r}
p1 <- countsDF %>%
ggplot(aes(x = n.noncensored)) +
geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
theme_minimal() +
ylab("Count") +
xlab("Number of non-censored measurements per subject")
p2 <- countsDF %>%
ggplot(aes(y = n.noncensored, x = followup)) +
geom_point(color = "#FF4F79", size = 2) +
xlab("Follow-up (years since diagnosis)") +
ylab("Number of non-censored FCAL measurements") +
theme_minimal() +
geom_hline(yintercept = 3)
p1 + p2 + plot_annotation(tag_levels = "A")
```
```{R calculate number of non censored FCAL}
# Number of individuals with at least 3 FCAL measurements
sum(countsDF$n.total >= 3)
# Number of individuals with at least 3 non-censored FCAL measurements
sum(countsDF$n.noncensored >= 3)
# Number of individuals with at least 3 non-censored FCAL measurements after
# discarding non-diagnostic values with negative calpro_time
sum(countsDF$n.noncensored - countsDF$n.negtime.nondiag >= 3)
```
The following is used to select individuals with at least 3 non-censored FCAL
measurements.
```{R FCAL frequency}
fcal <- fcal %>%
subset(ids %in% countsDF$ids[countsDF$n.noncensored >= 3])
length(unique(fcal$ids))
```
This leaves a cohort with `r length(unique(fcal$ids))` individuals.
```{r fcal followup post exclusions}
#| label: fig-fcal-follow-up-post
#| fig-cap: "(A) Histogram of the number of FCAL measurements per subject. (B) Histogram of FCAL follow-up per subject (before exclusions). (C) Scatterplot with number of FCAL measurements vs follow-up. In all cases, only individuals with at least 3 non-censored measurements are included."
#| fig-width: 12
#| column: body-outset
p1 <- countsDF %>%
subset(n.noncensored >= 3) %>%
ggplot(aes(x = n.total)) +
geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
theme_minimal() +
ylab("Count") +
xlab("Measurements per subject")
p2 <- countsDF %>%
subset(n.noncensored >= 3) %>%
ggplot(aes(x = followup)) +
geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
theme_minimal() +
ylab("Count") +
xlab("Follow-up (years)")
p3 <- countsDF %>%
subset(n.noncensored >= 3) %>%
ggplot(aes(y = n.total, x = followup)) +
geom_hex() +
xlab("Follow-up (years since diagnosis)") +
ylab("Number of FC measurements") +
theme_minimal() +
labs(fill = "Count")
p1 + p2 + p3 + plot_annotation(tag_levels = "A")
```
<!--- CAV: do we need this here? may remove and keep in an EDA file --->
```{R}
fcal %>%
select(ids) %>%
table() %>%
quantile(probs = c(0.25, 0.5, 0.75))
```
### Revisiting month of diagnosis
After filtering by subjects who have a diagnostic FC available, January
is no longer over-represented for month of diagnosis as we hypothesised
(@fig-diag-month-redux)
```{R Month of diagnosis (postprocessed)}
#| label: fig-diag-month-redux
#| fig-cap: "Bar plot of month of diagnosis"
dict.temp <- subset(dict, ids %in% unique(fcal$ids))
dict.temp %>%
ggplot(aes(x = as.factor(month(date.of.diag, label = TRUE)))) +
geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) +
theme_minimal() +
ylab("Frequency") +
xlab("Month of IBD Diagnosis")
```
## C-reactive protein
```{R crp preprocess}