practical.Rmd

---
title: "Introduction to Statistical Analysis"
author: "D.-L. Couturier and M. Fernandes (with contributions of M. Eldridge, M. Dunning and S. Vowler)"

output:
  pdf_document:
    toc: yes
    toc_depth: '3'
  html_document:
    code_folding: show
    css: stylesheets/styles.css
    df_print: paged
    fig_height: 6
    fig_width: 9
    highlight: tango
    number_sections: yes
    theme: lumen
    toc: yes
    toc_depth: 3
    toc_float: yes
---

<!--- rmarkdown::render("~/courses/cruk/IntroductionToStatisticalAnalysis/git_IntroductionToStats/practical.Rmd") --->
<!--- setwd("~/courses/cruk/IntroductionToStatisticalAnalysis/git_IntroductionToStats/") --->
<img src="stylesheets/logo.png" style="position:absolute;top:0px;right:0px;" width="300" />


---

```{r eval=TRUE, echo=FALSE, warning=FALSE, results="asis"}
#BiocStyle::markdown()
library("knitr")
opts_chunk$set(tidy=FALSE,dev="png",fig.show="as.is",
               fig.width=10,fig.height=4,
               message=FALSE,eval=FALSE,warning=FALSE,echo=FALSE)
```

In this practical, we will use several 'real-life' datasets to demonstrate some of the concepts you have seen in the lectures. We will guide you through how to analyse these datasets in Shiny and the kinds of questions you should be asking yourself when faced with similar data.

To answer the questions in this practical we will be using apps that we have developed using the [Shiny](http://shiny.rstudio.com/gallery/) add-on for the *R* statistical package. **R** is a freely-available open-source software that is popular within academic and commercial communities. The functionality within the software compares favourably with other statistical packages (SAS, SPSS and Stata). The downside is that **R** has a steep learning-curve and requires a basic familiarity with command-line software. To ease the transition we have chosen to present this course using a series of online tools that will allow you to perform statistical analysis without having to worry about learning R. At the same time, the R code required for the analysis will be recorded in the background. You will therefore be able to repeat the analysis at a later date, or pass on to others. As you gain familiarity with R through other courses, you will see how the code generated by Shiny can be adapted to your own needs.

The datasets you will need for this practical should be downloaded and unzipped now:- https://rawgit.com/bioinformatics-core-shared-training/IntroductionToStats/master/CourseData.zip

------

# Central limit theorem


<span style="color:rgb(235, 7, 142)">**Question (i):**</span>

The tab **Estimated coverage of Student's CI** in the shiny app **central-limit-theorem** displays the confidence intervals of 100 simulated datasets. 

1. Assuming that the simulated data are normally distributed, what is the probability of the **true** mean belonging to a confidence interval?
2. Let X denote a random variable that equals 1 if the **true mean belongs to the confidence interval** and 0 otherwise. What is the distribution of X?     
<!--- 3. What is the probability that 0 confidence intervals out of 50 contain the **true mean** if data are normally distributed? (too complex) --->

<span style="color:rgb(235, 7, 142)">**Question (ii):**</span>

Using the shiny app **central limit theorem**, answer the following questions:
[http://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/](http://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/)

1. Simulate **1000 samples** of **size n=10** of **Poisson** random variates, first assuming a **mean of 0.25**, and then assuming a **mean of 100**. Compare the coverage level of Student's confidence intervals for the mean of these 2 simulation sets: How do you explain that the latter is better than the first one?
2.  Now consider **zero-inflated Poisson** variates with a **mean of 30** and a **10% probability of belonging to the clump-at-zero**. Can you think of a random variable having such a distribution? How large should the sample size be for the Student's confidence intervals to have good properties?     
3.  A student lost a few points in the statistic exams as the use of Student's confidence intervals for the probability of success of a **Bernoulli** variable with **pi = 40%** and a **n=100** was not considered as suitable. Should he/she contact the University to dispute his/her mark?

------

&nbsp;

# One-Sample Tests

Use our Shiny app [http://bioinformatics.cruk.cam.ac.uk/stats/OneSampleTest](http://bioinformatics.cruk.cam.ac.uk/stats/OneSampleTest)
to perform one-sample location tests.


&nbsp;

## The effect of disease on height
A scientist knows that the mean height of females in England is **165cm** and wants to know whether her patients with a certain disease "X" have heights that differ significantly from the population mean - we will use a one-sample t-test to test this. The data are contained in the file **`diseaseX.csv`**.

To import the file `diseaseX.csv`; you will need to select the `Choose File` option from the `Data Input` tab and navigate to where the course data are located on your laptop. The right-hand panel of the `Data Input` tab should update to show the Heights of various individuals in the study.

Also, on the `Data Input` tab you will need to change the value of **Hypothesized mean** to the correct value.

<span style="color:rgb(235, 7, 142)">**Question:**</span> What are your null and alternative hypotheses?

------

```{r}
cat("Null hypothesis: The mean height of female patients with disease X = 165cm 
      (the population mean for females)\n")
cat("Alternative hypothesis: The mean height of female patients with disease X != 165cm 
      (the population mean for females)\n")
```

```{r}
myfile <- "Practical/diseaseX.csv"
sep <- ','
quote <- '"'
header <-  TRUE 
skip <-  0 
data <- read.csv(myfile, header=header, sep=sep, quote=quote,skip=skip)
```

A histogram and boxplot of the `Height` variable will be automatically generated for you. To view it, click on the ***Data Distribution***. You can toggle whether to overlay a density plot on top of the boxplot, or choose different bin sizes for the histogram.

<span style="color:rgb(235, 7, 142)">**Question:**</span>  Do the data look normally distributed? Based on the plots, is the parametric one-sample t –test appropriate?

------

```{r}
datacol <-  2 
X <- data[,datacol]
colnames(data)[datacol] <- 'X'
library(ggplot2)
ggplot(data, aes(x=X)) + geom_histogram(aes(y=..density..),binwidth=.5,colour='black', fill='white')+ stat_function(fun=dnorm,color='red',args=list(mean=mean(data$X), sd=sd(data$X)))
```

```{r}
cat("In this case the data look normally distributed. Therefore, the one-sample t-test is appropriate.")
```

We are interested in knowing whether the mean height in our sample of patients with disease X is different from that of the general population. Perform a **one-sample t-test** by clicking the ***Statistical Analysis*** tab.

<span style="color:rgb(235, 7, 142)">**Question:**</span>  What is the mean height in your sample? What is your value of t? What is the p-value? How do you interpret the p-value?

```{r}
alternative <- 'two.sided'
mu <-  0 
t.test(X,mu=mu,alternative=alternative)
```

```{r}
cat("Mean height in sample = 171.1cm (95% CI: 169.3-172.9)
t = 7.23, 
df = 19,
p  <= 0.0001.

Under the null hypothesis, the probability of observing a t-statistic as extreme as 7.23, 
is very small P(t <= 7.23 | t >= 7.23) < 0.0001). Therefore, there is strong evidence 
to reject the null hypothesis in favour of the alternative hypothesis. There is strong evidence 
to suggest that the mean height in female patients with disease X is different to the 
population mean height of females of 165cm.")

```

&nbsp;

## Blood vessel formation

In blood plasma cancer, there is an increase in blood vessel formation in the bone marrow. A stem cell transplant can be used as a treatment for blood plasma cancer. The column *Difference* of the file **`bloodplasmacancer1.csv`** reports the difference in bone marrow micro vessel density after and before treatment for 7 patients.    

We are interested in seeing whether there is a decrease in the bone marrow micro vessel density after treatment with a stem cell transplant. 

Import the file **`bloodplasmacancer1.csv`** as described previously.

------

<span style="color:rgb(235, 7, 142)">**Question:**</span> What are your null and alternative hypotheses?

```{r}
cat("Null hypothesis: the bone marrow micro vessel density after treatment is greater 
than or equal to the bone marrow micro vessel density before treatment. 

Alternative hypothesis: the bone marrow micro vessel density after treatment 
is less than the bone marrow micro vessel density before treatment.")

```

------

View the histogram and boxplot of the paired differences on the ***Differences*** tab. 

<span style="color:rgb(235, 7, 142)">**Question:**</span> Do the differences look normally distributed? Is the parametric t–test appropriate?

------

We are interested in seeing whether there is a decrease in the bone marrow micro vessel density after treatment with a stem cell transplant. 

<span style="color:rgb(235, 7, 142)">**Question:**</span> Is this a one-tailed or two-tailed test?

------

```{r}
cat("One-tailed as we are only interested in a decrease. 
Usually a two-sided test is preferred unless there is a strong argument 
for a one-sided test. In this case our treatment is only considered to be effective 
if we see a reduction in the bone marrow micro vessel density after treatment. 
Observing an increase in bone marrow density after treatment would lead to the same 
action/conclusion as if no difference had been observed – the treatment might be 
dropped from the research programme (but bear in mind here, 
the sample size is small and so only large differences may be detected).")
```

Now select the correct options in the ***Statistical Analysis*** tab in order to perform the analysis. Ensure you select the one- or two-tailed test as appropriate. 

<span style="color:rgb(235, 7, 142)">**Question:**</span> What is the mean difference? What is your value of t? What is the p-value? How do you interpret the p-value?


```{r}
alternative <- 'greater'
paired <- TRUE
var.equal <- FALSE
t.test(value~variable,data=data,alternative=alternative,paired=paired,var.equal=var.equal)

```

```{r}
cat("Under the null hypothesis, the probability of observing a t-statistic as extreme as 1.84, is 0.06, 
slightly greater than 0.05, our nominal significance level. Our result is borderline. 
There is insufficient evidence to reject the null hypothesis. 
Therefore, we might conclude that there is an association of a decrease in bone marrow micro vessel 
density after treatment with bone marrow transplant. It is important to note the small sample size here. 
Studying just 7 patients means we will only be able to detect large differences.")
```


```{r results='hide'}
myfile <- "Practical/bloodplasmacancer2.csv"
sep <- ','
quote <- '"'
header <-  TRUE 
skip <-  0 
data <- read.csv(myfile, header=header, sep=sep, quote=quote,skip=skip)
data <- reshape2::melt(data)
colnames(data) <- c('variable','value')
library(ggplot2)
p <- ggplot(data, aes(x=value)) + geom_histogram(aes(y=..density..),colour='black',fill='white') + facet_wrap(~variable) + stat_function(fun=dnorm,color='red',args=list(mean=mean(data$value), sd=sd(data$value)))
newDf <- do.call(cbind,split(data$value,data$variable))
Diff <- data.frame(Difference=newDf[,1] - newDf[,2])
p2 <- ggplot(Diff,aes(x=Difference)) + geom_histogram(aes(y=..density..),colour='black',fill='white') + stat_function(fun=dnorm, color='red', args=list(mean=mean(Diff$Difference), sd=sd(Diff$Difference)))
p <- gridExtra::grid.arrange(p,p2)
p

```

```{r}
cat("From this histogram it is difficult to tell whether the differences between 
  the densities before and after treatment are normally distributed. In situations 
  like this, we may need to draw on the experience of similar sets of measurements. ")
```

------

&nbsp;

# Two-Sample Tests

Use our Shiny app [http://bioinformatics.cruk.cam.ac.uk/stats/TwoSampleTest](http://bioinformatics.cruk.cam.ac.uk/stats/TwoSampleTest)
to perform tests of equality of means/medians. <!--- [http://bioinformatics.cruk.cam.ac.uk/stats/contingency-table](http://bioinformatics.cruk.cam.ac.uk/stats/contingency-table) to perform tests of equality of proportions.--->

&nbsp;

## Biological processes duration

In the file **`bp_times.csv`**, we have the durations of a biological process for two samples of wild-type and knock-out cells (times in seconds). We are interested in seeing whether there is a difference in the durations for the two types of cells – we shall use an **independent t-test** to compare the two cell-types.

Import the data using ***Choose File*** as before. Make sure that the ***1st column is a factor?*** checkbox is ticked.

<span style="color:rgb(235, 7, 142)">**Question:**</span>  What are your null and alternative hypotheses?

------

```{r}
cat("Null hypothesis: there is no difference in the duration of the 
biological process for the two cell types.

Alternative hypothesis: there is a difference in the duration of the 
biological process for the two cell types.")
```

Histograms and boxplots to compare the two groups will be created for you automatically. You can also see a basic numerical summary of the data distribution.


<span style="color:rgb(235, 7, 142)">**Question:**</span>  Do the data look normally distributed for each cell-type? Is the independent t-test appropriate? What statistics are appropriate to report the location (mean or median) and spread (sd or IQR) of the data?

------

```{r}
myfile <- "Practical/bp_times.csv"
sep <- ','
quote <- '"'
header <-  TRUE 
skip <-  0 
data <- read.csv(myfile, header=header, sep=sep, quote=quote,skip=skip)
colnames(data) <- c('variable','value')
p <- ggplot(data, aes(x=value)) + geom_histogram(aes(y=..density..),colour='black',fill='white') + facet_wrap(~variable) + stat_function(fun=dnorm,color='red',args=list(mean=mean(data$value), sd=sd(data$value)))
p
```

```{r}
cat("The data do appear to be approximately normally distributed as we could 
easily draw a bell shape over each of the two histograms. The independent t-test is appropriate")
```

In order to apply the correct statistical test, we need to test to see if the variances of the two groups are comparable. This is tested for us automatically in the Shiny app. Click the ***Statistical Analysis*** tab to see the result of the "F-test". However, it is often easier to eye-ball the data to assess the variances.

<span style="color:rgb(235, 7, 142)">**Question:**</span> What do you conclude from the p-value of this test. Does this agree with your impression of the variances from the boxplot and histograms? How does it influence what test to use?

------

```{r}
cat("The test yields a p-value of 0.03568, which is sufficient evidence 
    to reject the null hypothesis that the variances of the two groups are the same. 
    Therefore we should apply Welch's correction.")
```


Now use the appropriate two-sample t-test to compare the durations of the two groups.

<span style="color:rgb(235, 7, 142)">**Question:**</span> What is your value of the test statistic? What is the p-value? How do you interpret the p-value? 

```{r}
alternative <- 'two.sided'
paired <- FALSE
var.equal <-FALSE
t.test(value~variable,data=data,alternative=alternative,paired=paired,var.equal=var.equal)

```

```{r}
cat("When we run the independent (unpaired) t test, a formal comparison of the variances 
between the two groups is automatically run. The corresponding p-value is 0.00125 which indicates 
evidence to reject the null hypothesis.")
```
------

&nbsp;

## Blood vessel formation

In blood plasma cancer, there is an increase in blood vessel formation in the bone marrow. A stem cell transplant can be used as a treatment for blood plasma cancer. The bone marrow micro vessel density was measured before and after treatment for 7 patients with blood plasma cancer.

We are interested in seeing whether there is a decrease in the bone marrow micro vessel density after treatment with a stem cell transplant. We will use a paired two-sample t-test to compare the before and after bone marrow micro vessel densities. 

The data are contained in the file **`bloodplasmacancer2.csv`**. Import the data, making sure that ***`1st column is a factor`*** is *not* ticked. Now choose whether you will be performing a paired test or not by ticking the **Paired Samples?** box under ***Are your samples paired?***. 

<span style="color:rgb(235, 7, 142)">**Question:**</span> What are your null and alternative hypotheses?

------

Check the test statistic and p-value of the test of interest. 

<span style="color:rgb(235, 7, 142)">**Question:**</span> How is it that they  match the ones of ** Exercise 2.2** ? 


------

&nbsp;


## Gene Expression in Breast Cancer patients

A gene expression study was performed on patients categorised into positive and negative Estrogen Receptor (ER) groups. It is well-known that ER positive patients have more treatment options available and thus have more better prognosis.

The gene NIBP was measured as part of this study and the results are available in the file `NIBP.expression.csv`. We are interested to see if the expression level of the gene is different between ER positive and negative patients.


<span style="color:rgb(235, 7, 142)">**Question:**</span> What are your null and alternative hypotheses?

------

Now conduct an independent two-sample t-test to see if there is a difference in expression between the two groups.

<span style="color:rgb(235, 7, 142)">**Question:**</span> What is the p-value from the test? Do we achieve statistical significance at the 0.05 level?

------

Look closely at data distribution, calculated means for each group and the estimated confidence interval

<span style="color:rgb(235, 7, 142)">**Question:**</span> Is the finding likely to hold Biological significance? Would you be willing to put further resources into validating the finding?

------


&nbsp;

## Vitamin D levels

The file **`vitd.csv`** contains data on vitamin D levels for subjects with ("Y"), and without ("N") fibrosis. To import these data, you will need to select the ***1st column in a factor*** option. 

<span style="color:rgb(235, 7, 142)">**Question:**</span> State the null and alternative hypotheses

------

Examine the distribution of the data. 

<span style="color:rgb(235, 7, 142)">**Question:**</span> Why doesn't a parametric analysis seem appropriate? 

------

By un-ticking the ***Use Parametric Test?*** option in the ***Statistical Analysis*** tab you will see the results of a Mann-Whitney U (/ Wilxcoxon rank-sum test) test. 

<span style="color:rgb(235, 7, 142)">**Question:**</span> How do you interpret the value of the test? 

------


&nbsp;

## Birth-weight of twins

Dr D. R. Peterson of the Department of Epidemiology, University of Washington, collected the data found in file **`twins.csv`**.  It consists of the birth-weights of each of 20 dizygous twins.  One twin suffered Sudden Infant Death Syndrom (SIDS), and the other twin did not. The hypothesis to be tested is that the SIDS child of each pair had a lower birth-weight.  

<span style="color:rgb(235, 7, 142)">**Question:**</span> State the null and alternative hypothesises

------

Decide on the level of significance to be used and whether the test should be one-sided or two-sided.

<span style="color:rgb(235, 7, 142)">**Question:**</span> Would be appropriate to treat these as paired or independant samples?

------

Recall from the lectures that for paired data we have to consider whether the differences are symmetrical about zero. Carry out both the sign-test and Wilcoxon signed rank tests on the data.  

<span style="color:rgb(235, 7, 142)">**Question:**</span> Do both tests draw the same conclusion about the data?  Which test is the most appropriate?

------

```{r}
cat("Both tests show that there is insufficient evidence to reject the null hypothesis.  The Wilcoxon Signed Ranks test with a p-value of 0.258 and the sign test with a p-value of 0.5. Which test depends on whether this graph is symmetric about the black line or not.  If you think that it is then the assumption of the symmetry of difference scores about the true median difference for the Wilcoxon Signed ranks test holds and this test can be used. If not then the sign test should be used." )
```

&nbsp;

<!---
## Disease association

The following table gives the frequencies of wild-type and knock-out mice developing a disease thought to be associated to the absence of the knock-out gene. 

<div style='width:50%; margin:auto'>
&nbsp;     | WT | KO | Total
-----------|---:|---:|------:
Disease    |  1 |  7 |   8
No disease |  9 |  3 |  12
Total      | 10 | 10 |  20
</div>

<span style="color:rgb(235, 7, 142)">**Question:**</span> What are your null and alternative hypotheses?

------

```{r}
cat("Null hypothesis: there is no association between mouse type and disease X

Alternative hypothesis: there is an association between mouse type and disease X")

```

<span style="color:rgb(235, 7, 142)">**Question:**</span> What are your expected frequencies? 

------

```{r}
.Table <- matrix(c(1,7,9,3), 2, 2, byrow=TRUE)
rownames(.Table) <- c('Disease', 'No Disease')
colnames(.Table) <- c('WT', 'KO')
.Test <- chisq.test(.Table, correct=FALSE)
.Test$expected
```

Enter the data into the [Shiny app](http://bioinformatics.cruk.cam.ac.uk/stats/contingency-table/). Select the **Fisher's exact test** option to compare the proportion of mice in each group that developed the disease.

<span style="color:rgb(235, 7, 142)">**Question:**</span>  What is your p-value? How do you interpret the result?
---> 

------
```{r}
.Table <- matrix(c(1,7,9,3), 2, 2, byrow=TRUE)
rownames(.Table) <- c('Disease', 'No Disease')
colnames(.Table) <- c('WT', 'KO')
.Test <- chisq.test(.Table, correct=FALSE)
.Test$expected
ft<- fisher.test(.Table)
ft
```


```{r}
cat("p = 0.02. Under the null hypothesis, there is a small probability (p=0.02<0.05) 
of observing such an extreme distribution of the mice given the observed row and column totals. 
There is evidence to reject the null hypothesis in favour of the alternative hypothesis. 
There is evidence of an association between mouse type and disease X.")
```

&nbsp;

# Small-Group Exercise: Choosing a test 

In this section, we invite you to form small groups. Each group will be assigned one of the exercises. 

At the end of the time assigned for the exercise we will go through each of the problems in turn and invite a representative of each group to present the problem to the rest of the class along with the analysis (descriptive analysis, statistical tests) the group felt was most appropriate and any conclusions made.

If time allows, it would be beneficial for groups to familiarize themselves with some of the other exercises so that they can contribute to the presentations made by other groups.

You should use this [interactive document](https://public.etherpad-mozilla.org/p/2019-02-12-intro-to-stats) to record your observations.

```{r}
library(UsingR)
library(HistData)
library(ggplot2)
library(dplyr)
library(tidyr)
library(breastCancerNKI)
library(Biobase)
```

&nbsp;

## Group Exercise 1: Plant Growth `data1.csv`

Darwin (1876) studied the growth of *pairs* of zea may (aka corn) seedlings, one produced by cross-fertilization and the other produced by self-fertilization, but otherwise grown under identical conditions. His goal was to demonstrate the greater vigour of the cross-fertilized plants. The data recorded are the final height (inches, to the nearest 1/8th) of the plants in each pair.


<span style="color:rgb(235, 7, 142)">*Is there evidence to support the hypothesis of greater growth in Cross-fertilized plants?*</span>

```{r eval=FALSE,echo=FALSE}


library(HistData)
library(tidyr)
library(dplyr)
data("ZeaMays")
td <- select(ZeaMays,cross,self)
write.csv(td, file="mystery-data/data1.csv",quote=FALSE,row.names=FALSE)


```


```{r}
##In the Design of Experiments, Fisher (1935) used these data to illustrate a paired t-test 
##(well, a one-sample test on the mean difference, cross - self). 
##Later in the book (section 21), he used this data to illustrate an early example of a 
##non-parametric permutation test, treating each paired difference as having (randomly) 
##either a positive or negative sign.

```

&nbsp;

## Group Exercise 2: Florence Nightingale `data2.csv`

In the history of data visualization, Florence Nightingale is best remembered for her role as a social activist and her view that statistical data, presented in charts and diagrams, could be used as powerful arguments for medical reform.

After witnessing deplorable sanitary conditions in the Crimea, she wrote several influential texts (Nightingale, 1858, 1859), including polar-area graphs (sometimes called "Coxcombs" or rose diagrams), showing the number of deaths in the Crimean from battle compared to disease or preventable causes that could be reduced by better battlefield nursing care.

Her Diagram of the Causes of Mortality in the Army in the East showed that most of the British soldiers who died during the Crimean War died of sickness rather than of wounds or other causes. It also showed that the death rate was higher in the first year of the war, before a Sanitary Commissioners arrived in March 1855 to improve hygiene in the camps and hospitals.


<span style="color:rgb(235, 7, 142)">*Do the data support the claim that deaths due to avoidable causes decreased after a change in regime?*</span>


```{r eval=FALSE,echo=FALSE}

data("Nightingale")

require(reshape)
Night<- Nightingale[,c(1,8:10)]
melted <- melt(Night, "Date")
names(melted) <- c("Date", "Cause", "Deaths")
melted$Cause <- sub("\\.rate", "", melted$Cause)
melted$Regime <- ordered( rep(c(rep('Before', 12), rep('After', 12)), 3), 
                          levels=c('Before', 'After'))
Night <- melted

# subsets, to facilitate separate plotting
Night1 <- subset(Night, Date < as.Date("1855-04-01"))
Night2 <- subset(Night, Date >= as.Date("1855-04-01"))

# sort according to Deaths in decreasing order, so counts are not obscured [thx: Monique Graf]
Night1 <- Night1[order(Night1$Deaths, decreasing=TRUE),]
Night2 <- Night2[order(Night2$Deaths, decreasing=TRUE),]

# merge the two sorted files
Night <- rbind(Night1, Night2)


filter(Night, Cause=="Disease") %>% select(Regime,Deaths) %>% write.csv(file="mystery-data/data2.csv",quote=FALSE,row.names=FALSE)

filter(Night, Cause=="Disease") %>% select(Regime,Deaths) %>% ggplot(aes(x=Regime,y=Deaths)) + geom_boxplot()
Night.flt <- Night %>% filter(Cause=="Disease") %>% select(Regime,Deaths)


```


```{r}
##Non-parametric
##Non-paired

```

&nbsp;

## Group Exercise 3: Effect of bran on diet: `data3.csv`

The addition of bran to the diet has been reported to benefit patients with diverticulosis. Several different bran preparations are available, and a clinician wants to test the efficacy of two of them on patients, since favourable claims have been made for each. Among the consequences of administering bran that requires testing is the transit time through the alimentary canal. By random allocation the clinician selects two groups of patients aged 40-64 with diverticulosis of comparable severity. Sample 1 contains 15 patients who are given treatment A, and sample 2 contains 12 patients who are given treatment B.

<span style="color:rgb(235, 7, 142)">*Does transit time differ in the two groups of patients taking these two preparations?*</span>


```{r echo=FALSE,eval=FALSE}


data <- data.frame(Group = c(rep("A", 15),rep("B", 12)), Time = c(44,51,52,55,60,62,66,68,69,71,71,76,82,91,108, 
                                                                  52,64,68,74,79,83,84,88,95,97,101,116))

write.csv(data, file="mystery-data/data3.csv",quote=FALSE,row.names=FALSE)
t.test(Time~Group,data,var.equal=TRUE)
```


```{r}
##http://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/7-t-tests
##"---Parametric
##-Non-paired
##-Equal variances

```

&nbsp;

## Group Exercise 4: Effect of Autism drug  `data4.csv`

Consider a clinical investigation to assess the effectiveness of a new drug designed to reduce repetitive behaviors in children affected with autism. If the drug is effective, children will exhibit fewer repetitive behaviors on treatment as compared to when they are untreated. A total of 8 children with autism enroll in the study. Each child is observed by the study psychologist for a period of 3 hours both before treatment and then again after taking the new drug for 1 week. The time that each child is engaged in repetitive behavior during each 3 hour observation period is measured. Repetitive behavior is scored on a scale of 0 to 100 and scores represent the percent of the observation time in which the child is engaged in repetitive behavior. For example, a score of 0 indicates that during the entire observation period the child did not engage in repetitive behavior while a score of 100 indicates that the child was constantly engaged in repetitive behavior. 

<span style="color:rgb(235, 7, 142)">*Is there statistically significant improvement in repetitive behavior after 1 week of treatment?*</span>


```{r echo=FALSE,eval=FALSE}
###http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Nonparametric/BS704_Nonparametric5.html
## Non-parametric
## Non-matched
## Sign-test
data <- data.frame(Before=c(85,70,40,65,80,75,55,20),After = c(75,50,50,40,20,65,40,25))

write.csv(data, file="mystery-data/data4.csv",quote=FALSE,row.names=FALSE)


```

&nbsp;

## Group Exercise 5: CD4 `data5.csv`

CD4 cells are carried in the blood as part of the human immune system. One of the effects of the HIV virus is that these cells die. The count of CD4 cells is used in determining the onset of full-blown AIDS in a patient. In this study of the effectiveness of a new anti-viral drug on HIV, 20 HIV-positive patients had their CD4 counts recorded and then were put on a course of treatment with this drug. After using the drug for one year, their CD4 counts were again recorded. 

<span style="color:rgb(235, 7, 142)">*Do patients taking the drug have increased CD4 counts?*</span>


```{r eval=FALSE,echo=FALSE}
data <- read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/boot/cd4.csv")[,-1]
write.csv(data,file="mystery-data/data5.csv",quote=FALSE,row.names=FALSE)

boxplot(data)
t.test(data[,1],data[,2],paired=TRUE)
```

```{r}
##Parametric
##Paired
##Equal variances

```

&nbsp;

## Group Exercise 6: Drink Driving `data6.csv`

Drunk driving is one of the main causes of car accidents. Interviews with drunk drivers who were involved in accidents and survived revealed that one of the main problems is that drivers do not realize that they are impaired, thinking “I only had 1-2 drinks … I am OK to drive.”

A sample of 100 drivers was chosen, and their reaction times in an obstacle course were measured *before* and *after* drinking two beers. The purpose of this study was to check whether drivers are impaired after drinking two beers

<span style="color:rgb(235, 7, 142)">*Does drinking beer alter the reaction time of the driver?*</span>


```{r echo=FALSE,eval=FALSE}
set.seed("14102016")
## http://bolt.mph.ufl.edu/6050-6052/unit-4b/module-13/paired-t-test/
# Study of 20 people initially. Let's fake the data to add a few more observations....

data <- read.csv("http://media.news.health.ufl.edu/misc/bolt/Intro/files/unit4/Mod13Data/beers.csv")

data2 <- data.frame(Before = rnorm(100, mean=mean(data[,1]),sd = sd(data[,1])),After = rnorm(100, mean=mean(data[,2]),sd=sd(data[,2])))

write.csv(data2,file="mystery-data/data6.csv",quote=FALSE,row.names=FALSE)


```

```{r}
## http://bolt.mph.ufl.edu/6050-6052/unit-4b/module-13/paired-t-test/
## Parametric
## Paired
## non-equal variances

```

&nbsp;

## Group Exercise 7: Pollution in Trees `data7.csv`

Laureysens et al. (2004) measured metal content in the wood of 13 poplar clones growing in a polluted area, once in August and once in November. Concentrations of aluminum (in micrograms of Al per gram of wood) are shown below.

<span style="color:rgb(235, 7, 142)">*Is there any evidence for an increase in pollution between November and August?*</span>


```{r eval=FALSE,echo=FALSE}
## http://www.biostathandbook.com/wilcoxonsignedrank.html
##"The differences are somewhat skewed; the Wolterson clone, in particular, 
##has a much larger difference than any other clone. To be safe, 
##the authors analyzed the data using a Wilcoxon signed-rank test"

data <- data.frame(August = c(8.1,10,16.5,13.6,9.5,8.3,18.3,13.3,7.9,8.1,8.9,12.6,13.4), November = c(11.2,16.3,15.3,15.6,10.5,15.5,12.7,11.1,19.9,20.4,14.2,12.7,36.8))

write.csv(data, file="mystery-data/data7.csv",quote=FALSE,row.names=FALSE)
boxplot(data)
```

```{r}
## http://www.biostathandbook.com/wilcoxonsignedrank.html
## Non-parametric
### Paired
##"The differences are somewhat skewed; the Wolterson clone, in particular, has a much larger difference than any other clone. To be safe, the authors analyzed the data using a Wilcoxon signed-rank test"
```

&nbsp;

## Group Exercise 8: Salaries for Professors  `data8.csv`

The 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college's administration to monitor salary differences between male and female faculty members. (salary given as nine-month salary, in dollars.)

<span style="color:rgb(235, 7, 142)">*Is there evidence that Female professors are paid differently to their Male counterparts?*</span>


```{r eval=FALSE,echo=FALSE}
library(car)
data("Salaries")
library(dplyr)
td <- filter(Salaries,rank=="Prof") %>% select(sex, salary) %>% 
write.csv("mystery-data/data8.csv",row.names=FALSE,quote=FALSE)

```


```{r}
## Parametic
## Non-paired
### Non-equal variances
```

```{r eval=FALSE,echo=FALSE}
zip(zipfile="mystery-data.zip",files=paste("mystery-data/",list.files("mystery-data/"),sep="/"))
```