-
Notifications
You must be signed in to change notification settings - Fork 0
/
Two_sample_t.qmd
380 lines (241 loc) · 12.9 KB
/
Two_sample_t.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
---
title: "Two sample t test"
subtitle: "Explaining the Two sample t test and how to run it in R"
title-block-banner: true
date: "`r Sys.Date()`"
author:
- name: Tendai Gwanzura & Ana Bravo
affiliations:
- Florida International University
- Robert Stempel College of Public Health and Social Work
format:
html:
code-fold: true
html-math-method: katex
toc: true
editor: visual
theme: zephyr
highlight-style: pygemnts
---
# Two sample t test
This is also called the independent sample *t* test. It is used to see whether the unknown population means of two groups are equal or different. This test requires one variable which can be the exposure *x* and another variable which can be the outcome *y*. If you have more than two groups then analysis of variance *(ANOVA)* will be more suitable. If data is nonparametric then an alternative test to use would be the *Mann Whitney U test* or a *permutation test*.[@cressie1986].
There are two types of t tests, the first being the **Student's t test,** which assumes the variance of the two groups is equal, the second being the **Welch's t test** (default in R), which assumes the variance in the two groups is different.
In this article we will be discussing the **Student's t test**.
## Assumptions
- Measurements for one observation do not affect measurements for any other observation (assumes independence).
- Data in each group must be obtained via a random sample from the population.
- Data in each group are normally distributed.
- Data values are continuous.
- The variances for the two independent groups are equal in the **Student's t test**.
- There should be no significant outliers.
## Hypotheses
1. $(H_0)$: the mean of group A $(m_A)$ is equal to the mean of group B $(m_B)$- two tailed test.
2. $(H_0)$: the mean of group A $(m_A)$ is greater than or equal to mean of group B $(m_B)$- one tailed test.
3. $(H_0)$: the mean of group A $(m_A)$ is less than or equal to the mean of group B $(m_B)$- one tailed test.
The corresponding alternative hypotheses would be as follows:
<!-- -->
1. $(H_1)$: the mean of group A $(m_A)$ is different to the mean of group B $(m_B)$- two tailed test.
2. $(H_1)$: the mean of group A $(m_A)$ is less than the mean of group B $(m_B)$- one tailed test.
3. $(H_1)$: the mean of group A $(m_A)$ is greater the mean of group B $(m_B)$- one tailed test.
## Statistical hypotheses formula
For the **Student's t test** which assumes equal variance the following is how the \|t\| statistic may be calculated using groups A and B as examples:
$$t ={ {\bar{x}_{1} - \bar{x}_{2}} \over \sqrt{ s^2( {1 \over n_1 } + {1 \over n_2}) }}$$
This can be described as the sample mean difference divided by the sample standard deviation of the sample mean difference where:
$m_A$ and $m_B$ are the mean values of A and B,
$n_A$ and $n_B$ are the seize of group A and B,
$S^2$ is the estimator for the pooled variance,
with the degrees of freedom (*df)* = $n_A + n_B - 2$,
and $s^2$ is calculated as follows:
$$s^2 = { {\displaystyle\sum_{i=1}^{n_1} { (x_i-\bar{x}_1)^2} + \displaystyle\sum_{j=1}^{n_2}{ (x_j-\bar{x}_2)^2}} \over {n_{1} + n_{2} - 2 }}$$
Results for both **Students t test** and **Welch's t test** are usually similar unless the group sizes and standard deviations are different.
**What if the data is not independent?**
If the data is not independent such as paired data in the form of matched pairs which are correlated, we use the *paired t test*. This test checks whether the means of two paired groups are different from each other. It's usually used in clinical trial studies with a "before and after" or case control studies with matched pairs. For this test we only assume the difference of each pair to be normally distributed (the paired groups are the ones important for analysis) unlike the *independent t test* which assumes that data from both samples are independent and variances are equal.[@fralick]
------------------------------------------------------------------------
## Example
### Prerequisites
1. `tidyverse`: data manipulation and visualization.
2. `rstatix`: providing pipe friendly R functions for easy statistical analyses.
3. `car`: providing variance tests.
```{r, install }
#| echo: true
#install.packages("ggstatplot")
#install.packages("car")
#install.packages("rstatix")
#install.packages(tidyVerse)
```
### Dataset
This example dataset sourced from [kaggle](https://www.kaggle.com/datasets/uciml/student-alcohol-consumption/versions/2?resource=download) was obtained from surveys of students in Math and Portuguese classes in secondary school. It contains demographic information on gender, social and study information.
```{r, load-data}
#| echo: true
# load the dataset
stu_math <- read.csv("student-mat.csv")
```
```{r, load-libraries}
#| echo: true
#| message: false
#| warning: false
# load relevant libraries
library(rcompanion)
library(car)
library (gt)
library(gtsummary)
library(ggpubr)
library(rstatix)
library(tidyverse)
```
**Checking the data**
```{r}
#| echo: true
# check the data
glimpse(stu_math)
```
In total there are 395 observations and 33 variables. We will drop the variables we do not need and keep the variables that will help us answer the following: **Is there a difference between boys and girls in math final grades?**
$H_0$: There is no statistical difference between the final grades between boys and girls.
$H_1$: There is a statistically significant difference in the final grades between the two groups.
```{r, creating-subset}
#| echo: true
# creating a subset of the data
math = subset(stu_math, select= c(sex,G3))
glimpse(math)
```
**Summary statistics**- the dependent variable is continuous *(grades=G3)* and the independent variable is character but binary *(sex)*.
```{r}
#| echo: true
# summarizing our data
summary(math)
```
We see that data ranges from 0-20 with 0 being people who were absent and could not take the test therefore missing data. We remove these 0 values before running the t test. However other models should be considered such as the zero inflated model to differentiate those who truly got a 0 and those who were not present to take test.
```{r}
#| echo: true
# creating a boxplot to visualize the data with no outliers
math2 = subset(math, G3>0)
boxplot(G3 ~ sex,data=math2)
```
**Visualizing the data**- we can use histograms and box lots to visualize the data to check for outliers and distribution thus checking for normality.
```{r}
#| echo: true
# Histograms for data by groups
male = math2$G3[math2$sex == "M"]
female = math2$G3[math2$sex == "F"]
# plotting distribution for males
plotNormalHistogram(male, breaks= 20,xlim=c(0,20),main="Distribution of the grades for males ", xlab= "Math Grades")
```
Final grades for males seem to be normally distributed from 0-20. Data is approximately normal because we have a large amount of bins.
```{r}
#| echo: true
# plotting distribution for females
plotNormalHistogram(female, breaks= 20,xlim=c(0,20),main="Distribution of the grades for females ", xlab= "Math Grades")
```
Final grades for females also appear to be normally distributed. The final score across both is almost evenly distributed. However there seem to be a significant number of individuals who failed the test (grade=0).
```{r}
#| echo: true
# plotting bar plot to see the distribution in sample size
sample_size = table(math2$sex)
barplot(sample_size,main= "Distribution of sample size by sex")
```
The bar graph shows that there are slightly more females in the sample than males.
**Identifying outliers**
```{r}
#| echo: true
# creating a boxplot to visualize the outliers (G3=0)
boxplot(G3 ~ sex,data=math2)
```
The box plot shows us that there are no outliers as these have been removed in terms of people who had a score of 0. This score is not truly reflective of the performance between boys and girls as a grade of 0 may represent absentia or other reasons for the test not been taken. Therefore we opt to drop the outliers. We will compare to see if this decision affects the mean which appears similar from the above plot.
```{r}
#| echo: true
# finding the mean for the groups with outliers
mean(math$G3[math$sex=="F"])
mean(math$G3[math$sex=="M"])
# finding the mean for the groups without outliers
mean(math2$G3[math2$sex=="F"])
mean(math2$G3[math2$sex=="M"])
```
The mean has increased slightly and the difference decreased after removing the outliers but the distribution is still the same.
**Check the equality of variances (homogeneity)**
We can use the *Levene's test* or the *Bartlett's test* to check for homogeneity of variances. The former is in the `car` library and the later in the `rstatix` library. If the variances are homogeneous the p value will be greater than 0.05.
Other tests include *F test 2 sided*, *Brown-Forsythe* and *O'Brien* but we shall not cover these.
```{r}
#| echo: true
#| message: false
#| warning: false
# running the Bartleet's test to check equal variance
bartlett.test(G3~sex, data=math2)
# running the Levene's test to check equal variance
math2 %>% levene_test(G3~sex)
```
The p value is greater than 0.05 suggesting there is no difference between the variances of the two groups.
### Assessment
1. Data is continuous(G3)
2. Data is normally distributed
3. Data is independent (males and females distinct not the same individual)
4. No significant outliers
5. There are equal variances
As the assumptions are met we go ahead to perform the **Student's t test**.
### **Performing the two-sample *t*-test**
Since the default is the **Welch t test** we use the $\color{blue}{\text{var.eqaul = TRUE }}$ code to signify a **Student's t test**. There is a `t.test()` function in `stats` package and a `t_test()` in the `rstatix` package. For this analysis we use the `rstatix` method as it comes out as a table.
```{r}
#| echo: true
# perfoming the two sample t test
stat.test <- math2 %>%
t_test(G3~ sex, var.equal=TRUE) %>%
add_significance()
stat.test
```
```{r}
stat.test$statistic
```
The results are represented as follows;
- **y** - dependent variable
- **group1, group 2** - compared groups(independent variables)
- **df** - degrees of freedom
- **p** - p value
`gtsummary` table of results
```{r}
#| echo: true
#|
math2 |>
tbl_summary(
by = sex,
statistic =
list(
all_continuous() ~ "{mean} ({sd})",
all_dichotomous() ~ "{p}%")
) |>
add_n() |>
add_overall() |>
add_difference()
```
**Interpretation of results**
For the two sample t test with t(355) = -1.940477, p \< 0.0531, the p value is greater than our alpha of 0.05 , we fail to reject the null hypothesis and conclude that there is no statistical difference between the means of the two groups. There is no difference in final grades between boys and girls. *(A significant \|t\| would be 1.96).*
**Effect size**
*Cohen's d* can be an used as an effect size statistic for the two sample t test. It is the difference between the means of each group divided by the pooled standard deviation.
$d= {m_A-m_B \over SD_pooled}$
It ranges from 0 to infinity, with 0 indicating no effect where the means are equal. 0.5 means that the means differ by half the standard deviation of the data and 1 means they differ by 1 standard deviation. It is divided into small, medium or large using the following cut off points.
- ***small*** 0.2-\<0.5
- ***medium*** 0.5-\<0.8
- ***large*** \>=0.8
For the above test the following is how we can find the effect size;
```{r}
#| echo: true
#perfoming cohen's d
math2 %>%
cohens_d(G3~sex,var.equal = TRUE)
```
The effect size is ***small*** d= -0.20.
In conclusion, a two-samples t-test showed that the difference was not statistically significant, t(355) = -1.940477, p \< 0.0531, d = -0.20; where, t(355) is shorthand notation for a t-statistic that has 355 degrees of freedom and d is *Cohen's d*. We can conclude that the females mean final grade is greater than males final grade (d= -0.20) but this result is not significant.
**What if it is one tailed t test?**
Use the $\color{blue}{\text{alternative =}}$ option to determine if one group is $\color{blue}{\text{"less"}}$ or $\color{blue}{\text{"greater"}}$. For example if we want to see whether the final grades for females are greater than males we can use the following code:
```{r}
#| echo: true
# perfoming the one tailed two sample t test
stat.test <- math2 %>%
t_test(G3~ sex, var.equal=TRUE, alternative = "greater") %>%
add_significance()
stat.test
```
The p value is greater than 0.05 (p=0.973), we fail to reject the null hypothesis. We conclude that the final grades for females are not significantly greater than for males.
**What about running the paired sample t test?**
We can simply add the syntax $\color{blue}{\text{paired= TRUE}}$ to our `t_test()` to run the analysis for matched pairs data.
## Conclusion
This article covers the **Student's t test** and how we run it in R. It also shows how we find the effect size and how we can conclude the results.
## **References**