-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathEpiconceptCryptoCaseStudy.Rmd
1773 lines (1087 loc) · 51.3 KB
/
EpiconceptCryptoCaseStudy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 'Analysis of surveillance data : Analysing Cryptosporidium notification data
from country X, 2004-2015'
output:
worded::rdocx_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, echo = F, warning= F, message= F}
#load packages
#Those required for creating this markdown:
#"Worded": for styling and pagebreaks
#"knitr": for styling tables
required_packages <- c("worded", "knitr")
for (i in seq(along = required_packages)) {
library(required_packages[i], character.only = TRUE)
}
```
<!---CHUNK_PAGEBREAK--->
# Copyright and License
**Source:**
This case study was first designed by Esther Kissling, EpiConcept, 2016; it was then translated in to *R* by Alexander Spina in 2018. It is based on surveillance data from an anonymous country.
**Revisions:**
*If you modify this case study, please indicate below your name and changes you made*
**You are free:**
- **to Share** — to copy, distribute and transmit the work
- **to Remix** — to adapt the work
**Under the following conditions:**
- **Attribution** — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). The best way to do this is to keep as it is the list of contributors: sources, authors and reviewers.
- **Share Alike** — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one. Your changes must be documented. Under that condition, you are allowed to add your name to the list of contributors.
- You cannot sell this work alone but you can use it as part of a teaching.
**With the understanding that:**
- **Waiver** — Any of the above conditions can be waived if you get permission from the copyright holder.
- **Public Domain** — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- **Other Rights** — In no way are any of the following rights affected by the license:
- Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
- The author's moral rights;
- Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.
- **Notice** — For any reuse or distribution, you must make clear to others the license terms of this work by keeping together this work and the current license.
This licence is based on http://creativecommons.org/licenses/by-sa/3.0/
<!---CHUNK_PAGEBREAK--->
# Objectives
At the end of the case study, participants should be able to analyse surveillance data using Stata, going through all steps including
- data checking,
- data cleaning/recoding,
- data description,
- appropriate statistical testing,
- merging datasets with denominators,
- calculating incidence rates and
- calculating incidence rate ratios.
# Guide to the case study
The case study is designed for use with *R* statistical programming software.
All files necessary for completing a session are placed in the corresponding session folder. There should be no need to copy files from other session folders.
<!---CHUNK_PAGEBREAK--->
# Background
Cryptosporidium is a protozoal parasite that causes a diarrhoeal illness in humans known as cryptosporidiosis. It is transmitted by the faeco-oral route, with both animals and humans serving as potential reservoirs. Cryptosporidiosis is a notifiable disease in many countries across the world.
## Case definition
3.9 Cryptosporidiosis
**Clinical Criteria**
Any person with at least one of the following two:
- Diarrhoea;
- Abdominal pain.
**Laboratory Criteria**
At least one of the following four:
- Demonstration of *Cryptosporidium* oocysts in stool;
- Demonstration of *Cryptosporidium* in intestinal fluid or small-bowel biopsy specimens;
- Detection of *Cryptosporidium* nucleic acid in stool;
- Detection of *Cryptosporidium* antigen in stool.
**Epidemiological Criteria**
One of the following *five* epidemiological links:
- Human to human transmission
- Exposure to a common source
- Animal to human transmission
- Exposure to contaminated food/drinking water
- Environmental exposure
**Case Classification**
**A. Possible case** NA
**B. Probable case**
Any person meeting the clinical criteria with an epidemiological link
**C. Confirmed case**
Any person meeting the clinical and laboratory criteria
Note: If the national surveillance system is not capturing clinical symptoms, all laboratory-confirmed individuals should be reported as confirmed cases.
*From:* [Commission Implementing Decision on the communicable diseases and related special health issues to be covered by epidemiological surveillance – Annex 1 ](https://ec.europa.eu/health/sites/health/files/communicable_diseases/docs/2018_impldecision_annex1_en.pdf) (*replacing Commission Decision No. 2000/96/EC).
## Datasets
The main dataset for the case study covers the years from 2004-2015. The dataset is available in Excel: "crypto.xls".
Each record designates one case of Cryptosporidium. Information on species (e.g. *C. parvum*, *C. hominis*, etc.) is not available in this dataset in country X. Due to funding restrictions, routine speciation of samples was stopped in most laboratories in country X.
Below you can find a data dictionary of the variables and values included in the dataset:
```{r, echo = F}
kable( matrix( c(
"Variable name", "Type", "Code", "Definition",
"ID", "String", "", "Identifies each patient with cryptosporidiosis",
"Notif_Date", "Date", "dd/mm/yyyy", "Date of notification",
"Week", "Integer", "", "Week of notification",
"Month", "Integer", "", "Month of notification",
"Quarter", "Integer", "0=female, 1=male", "Quarter of notification",
"Year", "Integer", "", "Year of notification",
"Region", "Integer", "1, 2, 3, 4, 5, 6, 7", "Region of notification of case",
"OnsetDate", "Date", "dd/mm/yyyy", "Date of symptom onset",
"AgeY", "Integer", "", "Age in years",
"Sex", "Integer", "0=female, 1=male", "Gender",
"PatientType", "String", "", "Type of patient, e.g. A&E patient, GP patient, Hospital inpatient, etc.",
"CountryofInfection", "String", "", "Suspected country of infection.",
"CaseClassification", "String", "", "Classification of case: confirmed, probable, not specified."
), ncol = 4, nrow = 14, byrow = T)
)
```
You will also be using denominator data for the case study. Here we are using 2011 denominator data in the Excel spreadsheet: “denominators2011.xls”.
There are three tabs on the spreadsheet: Region, AgeGroup, AgeGroup by Region, which give the following information
- Region: provides total population numbers by region in 2011
- AgeGroup: provides total population numbers for each age in years from 0 to 100 in 2011 - AgeGroup by Region: provides total population numbers for each age in years from 0 to 100 by region in 2011
## Main task for the case study
Your main task for the case study is to analyse the surveillance data, with a focus on 2015 data. Your task is in particular to calculate incidence rates, and compare the 2015 data with 2014 data and other previous years.
When using *R*, please use *R-scripts*, ensuring there is a “Master” *R-scripts* that serves as a table of contents. This "Master" file should link to other *R-scripts* using the *source* function.
## Q1: Create a plan of analysis
<!---CHUNK_PAGEBREAK--->
## Help Q1 (example):
**Data checking**
- Checks for completeness
- Checks for legal values (range, unexpected values): cross-tabulations
- Checking consistency of dates
- Histograms of continuous variables (age and date variables)
**Recoding the data**
- Recode continuous variables if needed (e.g. age into age groups)
- Recode string variables where appropriate
- Add labels if appropriate
- Possibly:
- Recode PatientType variable into hospitalised/not hospitalised
- Create a proxy for urban/rural
- Create an imported (yes/no) variable
- Create a “count” variable indicating that each line represents one case (which facilitates further analysis)
**Descriptive analysis (with a focus on 2015 data)**
- Describe the number of cases in 2015
- Describe age (age histogram, median and interquartile range)
- Describe sex, hospitalisation status of patient, region of notification, country of infection.
**Comparative analysis 2015 to 2014**
- Age histogram 2015 to 2014
- Comparison of age 2015 to 2014:
- Comparison of means
- Comparison of medians
- Comparison of proportion of male/female, hospitalised, urban/rural
- Choose the appropriate statistical tests and the appropriate level of confidence
**Calculating annual incidence rates**
- Calculate incidence rates and their 95% CI by year using Stata’s “ci” command
- Plot the annual incidence rates
- Calculate incidence rates and their 95% CI by year and region
- Calculate age group-specific incidence rates and their 95% CI by year
**Calculate incidence rate ratios (examples)**
- Calculate incidence rate ratios, 95% CI and p-values between years with 2015 as the reference
- Calculate incidence rate ratios, 95% CI and p-values between the average of 2012-14 and 2015
- Calculate incidence rate ratios, 95% CI and p-values between urban and rural areas for 2015.
## Q2: Data checking
Install and Load packages required for this case study. Ensure you are in your correct working directory. Import the worksheet “Crypto” in the Excel spreadsheet “crypto.xls”. Familiarise yourself with the data. Check the data: check completeness of the variables, cross-tabulate the data to check legal values, check consistency of dates, check continuous variables. NB: You can refer to the data dictionary in the background section.
Are there any issues in the data? Do you need to carry out any data cleaning?
Use *R-scripts*.
<!---CHUNK_PAGEBREAK--->
## Help Q2:
### Installing packages and functions
*R* packages are bundles of functions which extend the capability of R. Thousands of add-on packages are available in the main online repository (known as [CRAN](https://cran.r-project.org/)) and many more packages in development can be found on [GitHub](https://github.com/). They may be installed and updated over the Internet.
We will mainly use packages which come ready installed with R (base code), but where it makes things easier we will use add-on packages. All the R packages you need for the exercises can be installed over the Internet.
You will need to install these before starting. We will be using the following packages.
- *epiR*: for creating two by two tables and calculating incidence rates
- *broom*: for cleaning up the output of poisson regressions
Packages can be installed using the *install.packages* function, where you specify the name of the package in quotation marks and whether you also want to install other packages which are required to run the package of interest (where TRUE/FALSE means YES/NO); for example:
```{r, eval = F}
#install the epiR package and packages it depends on
install.packages("epiR", dependencies = TRUE)
```
If you want to install multiple packages at once you can simply save (assign them to an object using the arrow, and "c" simply pulls strings together) the names of the packages you are interested in as a string object (which you can call whatever you like, in this case we called it required_packages) and pass these through the *install.packages* function; for example:
```{r, eval = F}
# Installing multiple required packages for the case study
required_packages <- c("epiR", "broom")
install.packages(required_packages, dependencies = TRUE)
```
Alternatively, if you are unsure whether the packages are already installed you can run the following for-loop. It is not necessary to understand the code at this point, but simply appreciated that it is doing the same as above, while also checking whether the packages are installed.
```{r, eval = F}
# Installing required packages for this case study
for (pkg in required_packages) {
if (!pkg %in% rownames(installed.packages())) {
install.packages(pkg)
}
}
```
<!---CHUNK_PAGEBREAK--->
Once you have successfully installed all your packages they are saved on your computer. This means that you only need to install them the first time.
After that, whenever you want to use a package in your *R* session, you need to load the package using the *library* function; the console will give you several messages in red, however these are most often just information and not errors. Note that here, you do not require quotation marks, for example:
```{r, eval = F}
library(epiR)
```
Here too, you can load multiple packages at the same time within a loop; again do not try to understand this just yet, just appreciate it is possible.
```{r, results='hide', message=FALSE, warning=FALSE}
# Loading required packages for this case study
required_packages <- c("epiR", "broom")
for (i in seq(along = required_packages)) {
library(required_packages[i], character.only = TRUE)
}
```
### Setting your working directory
You can check the path for your current working directory using the *getwd* function.
```{r, eval = F}
#Check your current working directory
getwd()
```
To set your working directory you can use the *setwd* function.
```{r, eval = F}
setwd("C:/Users/Username/Desktop/EpiconceptCrypto")
```
### Reading in files
Import the dataset from a comma separated value (.csv) file using the *read.csv* function, storing it as a data frame within *R* called crypto. For a CSV file the separator is normally a comma, however depending on the language of your operating system this can also be other values, for example a semi-colon. Here we also specify that we do not want to read in string (character or grouped variables as factors).
```{r}
crypto <- read.csv("crypto.csv", sep = ";" ,
stringsAsFactors = FALSE )
```
### Familiarise yourself with data
You can examine the structure of your data set using the following functions. The *str* function will provide an overview of which variable types are in your dataset. The *summary* function will give minimum, maximum, first and third quartiles as well as medians and means for variables which are not strings (characters). Each of these commands can be run for individual variables also. You can refer to an individual variable of a data set by using the **$**, for example, if you wanted to obtain a summary of the a numeric age variable, then you would write **summary(crypto\$age)**.
```{r, eval=F}
# str provides an overview of the number of observations and variable types
str(crypto)
# summary provides mean, median and max values of your variables
summary(crypto)
```
You can also look at completeness of specific variables by combining the *table* function with the *is.na* function. It is also possible to combine multiple arguments using arguments such as "and" (using &) as well as "or" (using |). Note that *R* is case sensitive, so that there is a difference between "Not specified" and "Not Specified" in the PatientType and CountryofInfection variables This could also be achieved using a package as described in the appendix.
```{r, eval = F}
# Examine how many are missing or unknown in the AgeY variable
table(is.na(crypto$AgeY) | crypto$AgeY == "Unknown")
# missing, unknown or not specified in the PatientType variable
table(is.na(crypto$PatientType) |
crypto$PatientType == "Unknown" |
crypto$PatientType == "Not Specified")
# missing, unknown or not specified in the CountryofInfection variable
table(is.na(crypto$CountryofInfection) |
crypto$CountryofInfection == "Unknown" |
crypto$CountryofInfection == "Not specified")
```
You can also check if the onset date is before or on the same day as the notification date, and then return the corresponding IDs:
```{r, eval = F}
# check number not missing with onset on or before notification date
table(!is.na(crypto$Notif_Date) &
crypto$OnsetDate <= crypto$Notif_Date)
```
There are two ways to select rows which have onset after notification. One way is to use the *subset* function, which you specify the dataset in the x argument, then provide a rule for selecting rows in the subset argument and finally specify which columns to select. The second alternative involves using square brackets to subset the data frame; in this scenario what comes before the comma specifies rows and what comes after specifies columns; for example **dataset[rows, columns]**. Both options give the same outcome.
<!---CHUNK_PAGEBREAK--->
```{r, eval = F}
# return IDs, onset and notification dates for those with onset after notification
subset(
x = crypto,
subset = !is.na(crypto$Notif_Date) &
crypto$OnsetDate > crypto$Notif_Date,
select = c("ID", "OnsetDate", "Notif_Date")
)
# return IDs, onset and notification dates for those with onset after notification
crypto[which(!is.na(crypto$Notif_Date) &
crypto$OnsetDate > crypto$Notif_Date), c("ID", "OnsetDate", "Notif_Date")]
```
You can also check if all ages are within a reasonable age range. To do this first change "Unknown" to be NA and then create a new age variable which is AgeY in numeric form.
```{r}
# replace unknown with NA
crypto$AgeY[crypto$AgeY == "Unknown"] <- NA
# create new age variable as numeric of AgeY
crypto$age <- as.numeric(crypto$AgeY)
```
You can then use *summary* to get information about the age variable.
```{r, eval = F}
# summary provides mean, median and max values of age
summary(crypto$age)
```
For numeric variables, such as age and dates, you can use histograms to check for unusual patterns or outliers. You specify your variable as well as axis labels. To save you can plot, then use *dev.copy* to choose a file type and name; *dev.off* closes the connection.
For date variable you need to specify what time frame you would like to plot, such as "days", "weeks", "months", "years". Because there are so many points to plot, you need to specify you want the frequency, otherwise the density will be plotted.
<!---CHUNK_PAGEBREAK--->
```{r, eval = F}
#Plot a histogram of age
hist(crypto$age,
xlab = "Age",
ylab = "Count"
)
#save histogram of age as a png file
dev.copy(png,'age.png')
dev.off()
#plot histogram of notification date
#choose days and frequency
hist(crypto$Notif_Date,
breaks = "days",
freq = TRUE,
xlab = "Notification date",
ylab = "Count"
)
#save as a png
dev.copy(png,'notificationdate.png')
dev.off()
#plot histogram of onset date
#choose days and frequency
hist(crypto$OnsetDate,
breaks = "days",
freq = TRUE,
xlab = "Onset date",
ylab = "Count"
)
#save as a png
dev.copy(png,'onsetdate.png')
dev.off()
```
<!---CHUNK_PAGEBREAK--->
## Q3: Data recoding
Rename all variable names to lower case. Recode string variables to numeric where this is useful. Add labels to relevant variables. Create the age bands used for the annual report (0-4 5-9 10-14 15-19 20-24 25-34 35-44 45-54 55-64 65+). Create a variable called “count” signifying that each record has one disease count.
Optional: Generate a new variable indicating if a patient is hospitalised or not. Create a variable for “urban/rural”, with Region 1 as a proxy for “urban”. Create a variable indicating if this is an imported case or not.
<!---CHUNK_PAGEBREAK--->
## Help Q3:
### Reading in files
Import the dataset from a comma separated value (.csv) file using the *read.csv* function, storing it as a data frame within *R* called crypto. For a CSV file the separator is normally a comma, however depending on the language of your operating system this can also be other values, for example a semi-colon. Here we also specify that we do not want to read in string (character or grouped variables as factors).
```{r}
crypto <- read.csv("crypto.csv", sep = ";" ,
stringsAsFactors = FALSE )
```
### Rename all variable names to all lowercase letters:
You can check and change the variable names in your dataset using the *names* function. Then using the *tolower* function you can re-assign these in lower case letters.
```{r}
names(crypto) <- tolower( names(crypto) )
```
### Recode string to numeric variables, where useful:
You have already done this in the previous section, but now the variable names are in lower case.
```{r}
# replace unknown with NA
crypto$agey[crypto$agey == "Unknown"] <- NA
# create new age variable as numeric of AgeY
crypto$age <- as.numeric(crypto$agey)
```
### Add labels where appropriate:
In order to add labels in *R* you have to change variables in to factors. This allows you to specify levels (the order in which categories appear in output) and then label these levels.
```{r}
#re-write the sex variable as a factor defining levels and labels
crypto$sex <- factor(crypto$sex,
levels = c(1, 0),
labels = c("male", "female")
)
```
```{r, eval = F}
#Check the outcome
table(crypto$sex, useNA = "always")
```
### Create annual report age groups with labels:
There are several ways to do this. The simplest version is as below, for other options see the appendix.
```{r}
#generate an empty variable called ar_age
crypto$ar_age <- NA
#where age is under 5, set ar_age to 0
crypto$ar_age[crypto$age < 5] <- 0
#set the rest of the groups
crypto$ar_age[crypto$age >= 5 &
crypto$age < 10] <- 1
crypto$ar_age[crypto$age >= 10 &
crypto$age < 15] <- 2
crypto$ar_age[crypto$age >= 15 &
crypto$age < 20] <- 3
crypto$ar_age[crypto$age >= 20 &
crypto$age < 25] <- 4
crypto$ar_age[crypto$age >= 25 &
crypto$age < 35] <- 5
crypto$ar_age[crypto$age >= 35 &
crypto$age < 45] <- 6
crypto$ar_age[crypto$age >= 45 &
crypto$age < 55] <- 7
crypto$ar_age[crypto$age >= 55 &
crypto$age < 65] <- 8
crypto$ar_age[crypto$age >= 65] <- 9
#change to a factor and define labels
crypto$ar_age <- factor(crypto$ar_age,
levels = 0:9,
labels = c("0-4",
"5-9",
"10-14",
"15-19",
"20-24",
"25-34",
"35-44",
"45-54",
"55-64",
"65+"
)
)
```
### Add a count variable that signifies one count of disease:
```{r}
crypto$count <- 1
```
### Save the file:
You can save your cleaned dataset as an R datafile (.Rda) using the *save* function and re-load the same dataset using the *load* function.
```{r, eval= F}
#save your dataset
save(crypto, file = "crypto.Rda")
```
### Optional
*NB.* If doing the optional recoding, please save the file at the end.
### Create a variable for “hospitalised”:
```{r}
#If hospital inpatient then 1 else 0
crypto$hospitalised <- ifelse(crypto$patienttype == "Hospital Inpatient",
1, 0)
#Not specified and unknown set to missing
crypto$hospitalised[crypto$patienttype == "Not Specified" |
crypto$patienttype == "Unknown"] <- NA
```
### Create a proxy for urban vs. rural:
```{r}
#if region is 1 then urban else rural
crypto$urban <- ifelse(crypto$region == 1, 1, 0)
#add order and labels
crypto$urban <- factor(crypto$urban,
levels = c(1, 0),
labels = c("urban", "rural")
)
```
### Create an imported variable:
```{r}
#If country X then not imported, else imported
crypto$imported <- ifelse(crypto$countryofinfection == "Country X", 0, 1)
#Not specified and unknown set to missing
crypto$imported[crypto$countryofinfection == "Not Specified" |
crypto$countryofinfection == "Unknown"] <- NA
#add order and labels
crypto$imported <- factor(crypto$imported,
levels = c(1, 0),
labels = c("Imported", "Country X")
)
```
```{r, echo = F}
#save your dataset
save(crypto, file = "crypto.Rda")
```
<!---CHUNK_PAGEBREAK--->
## Q4: Descriptive analysis
Use the dataset “crypto recoded.dta”. Focus on the year 2015. Describe the variables in the dataset. Summarise the results.
<!---CHUNK_PAGEBREAK--->
## Help Q4:
Open your dataset using the load function.
```{r}
#load your dataset
load("crypto.Rda")
```
Restrict your data to 2015 using the subset function. In this situation you over-write your dataset with the subset
```{r}
#assign your 2015 subset to crypto (over-write original crypto)
crypto <- subset(
x = crypto,
subset = year == 2015
)
```
How many cases were notified?
```{r}
#check number of rows in your dataset
nrow(crypto)
```
Describe age.
```{r, eval = F}
#Plot a histogram of age
#you can specify a bar for each age with "breaks"
#you can set your x axis from 0-100 using "xlim"
hist(crypto$age,
xlab = "Age",
ylab = "Count",
breaks = 100,
xlim = c(0, 100)
)
#Get a summary of age
summary(crypto$age)
```
<!---CHUNK_PAGEBREAK--->
To plot side by side histograms you need to use the "par" function.
```{r, fig.width = 6}
#specify you want one row of two histograms
par(mfrow = c(1,2))
#plot a histogram for males (use squarebrackets to subset)
#give a title using "main",
#set the y axis limits using ylim
hist(crypto$age[crypto$sex == "male"],
main = "male",
xlab = "Age",
ylab = "Count",
breaks = 100,
xlim = c(0, 100),
ylim = c(0, 50) )
#plot a histogram for females
hist(crypto$age[crypto$sex == "female"],
main = "female",
xlab = "Age",
ylab = "Count",
breaks = 100,
xlim = c(0, 100),
ylim = c(0, 40) )
```
<!---CHUNK_PAGEBREAK--->
Describe sex. To see how to bind these together in to a single contingency table, see the appendix.
```{r, eval = F}
#get counts of sex
#save table as "counts"
counts <- table(crypto$sex)
#get proportions for counts table
prop.table(counts)
#you could also multiple by 100 and round to 2 digits
round(prop.table(counts)*100, digits = 2)
```
<!---CHUNK_PAGEBREAK--->
Describing hospitalised patients
```{r, eval = F}
#get counts of hospitalisations
#save table as "counts"
counts <- table(crypto$hospitalised)
#get rounded proportions of counts
round(prop.table(counts)*100, digits = 2)
```
Do the same for age groupings among hospitalised patients
```{r, eval = F}
# get counts of hospitalisations by agegroup
# save table as "counts"
counts <- table(crypto$ar_age, crypto$hospitalised)
# get rounded proportions of counts
# specify that you want row proportions (margin = 1)
round(
prop.table(counts, margin = 1) * 100,
digits = 2
)
```
Describe urban.
```{r, eval = F}
#get counts
#save table as "counts"
counts <- table(crypto$urban)
#get rounded proportions of counts
round(prop.table(counts)*100, digits = 2)
```
Describe imported.
```{r, eval = F}
#get counts
#save table as "counts"
counts <- table(crypto$imported)
#get rounded proportions of counts
round(prop.table(counts)*100, digits = 2)
```
<!---CHUNK_PAGEBREAK--->
## Q5: Comparative analysis 2015 vs. 2014
Use the dataset “crypto recoded.dta”. Focus on the year 2015 compared to 2014 data.
Look for differences in age, gender, levels of hospitalisation and urban/rural distribution. Think of which statistical tests to use where appropriate.
<!---CHUNK_PAGEBREAK--->
## Help Q5:
Open your dataset using the load function.
```{r}
#load your dataset
load("crypto.Rda")
```
Drop data from years that are not relevant using the subset function. For this analysis we are only interested in 2015 and 2014 data. In this situation you over-write your dataset with the subset.
```{r}
#assign your 2014-2015 subset to crypto (over-write original crypto)
crypto <- subset(
x = crypto,
subset = year >= 2014
)
#check cases per year
table(crypto$year, useNA = "always")
```
Compare age in 2015 to 2014:
```{r, eval = F}
# specify you want one row of two histograms
par(mfrow = c(1,2))
# plot a histogram for males (use squarebrackets to subset)
# give a title using "main",
# set the y axis limits using ylim
hist(crypto$age[crypto$year == 2014],
main = "2014",
xlab = "Age",
ylab = "Count",
breaks = 100,
xlim = c(0, 100),
ylim = c(0, 100) )
# plot a histogram for females
hist(crypto$age[crypto$year == 2015],
main = "2015",
xlab = "Age",
ylab = "Count",
breaks = 100,
xlim = c(0, 100),
ylim = c(0, 100) )
```
<!---CHUNK_PAGEBREAK--->
Look at the median and the interquartile range and test for equality of distributions:
```{r, eval = F}
# use the aggregate function to group by year
# year must be as a list
# specify the function you would like to use (summary)
aggregate(crypto$age, by = list(crypto$year), FUN = summary)
# use the boxplot function to plot
boxplot(age~year, data = crypto)
```
```{r}
wilcox.test(crypto$age~crypto$year)
```
Look at the means (and standard deviations) and compare means using the t-test:
```{r, eval = F}
# use the aggregate function to group by year
# year must be as a list
# specify the function you would like to use (summary)
aggregate(crypto$age, by = list(crypto$year), FUN = summary)
# use t.test function to compare means
t.test(crypto$age ~ crypto$year)
```
Comparison of proportion of male/female, hospitalised, urban/rural, imported/not imported:
```{r, eval = F}
# For sex
# get counts
# save table as "counts"
counts <- table(crypto$sex, crypto$year)
# get rounded proportions of counts
# margin = 2 for column proportions
round(prop.table(counts, margin = 2) * 100, digits = 2)
# chisq.test function requires you to input a table
chisq.test(counts)
```
```{r, eval = F}
# For hospitalised
# get counts
# save table as "counts"
counts <- table(crypto$hospitalised, crypto$year)
# get rounded proportions of counts
# margin = 2 for column proportions
round(prop.table(counts, margin = 2) * 100, digits = 2)
# chisq.test function requires you to input a table
chisq.test(counts)
```