-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPS8.qmd
127 lines (79 loc) · 6.5 KB
/
PS8.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
title: "PS8 - Scatterplots and Correlation"
format: pdf
jupyter: python3
---
# Introduction
Please watch [Correlation Does Not Equal Causation](https://www.youtube.com/watch?v=GtV-VYdNt_g&list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr&index=9&t=1s) before starting the problem set.
# Setup
For this problem set we are going to load several new packages: `scikit-learn`, `statsmodels`, and `pydataset`. We will use `scikit-learn`, and `statsmodels` for doing statistical analysis. Both of these packages contain much of the same functionality, but their focus is a little different. The `pydataset` contains datasets we will practice doing statistics with.
This problem set uses the Diabetes dataset from `scikit-learn` as an example.
```{python}
import pandas as pd
import matplotlib.pyplot as plt
from pydataset import data
from sklearn.datasets import load_diabetes
#set options to always display all columns in DataFrame
pd.set_option('display.max_columns', None)
# load diabetes data
diabetes_data = load_diabetes(as_frame=True)
# get documenation on diabetes data
print(diabetes_data.DESCR)
#extract data frame
diabetes = diabetes_data.frame
```
# What Question Are You Asking?
Whenever you sit down to do statistics, it is always a good idea to keep in mind what question you are trying to answer. In general, there are two types of questions statisticians can ask: questions of prediction and questions of inference.
## Questions of prediction
Questions of prediction are questions that ask us to predict the outcome of a situation we do not know. For example, we may want to know how a student will do on a final, based on their grades during the rest of the class. We may want to predict which basketball team match-ups will be most competitive, so we don't buy tickets to a boring game. A very common prediction question is what will the weather be like tomorrow, or a week from tomorrow. For the diabetes dataset, we are interested in predicting how quickly a person's diabetes will progress once diagnosed. This is particularly of interest because uncontrolled diabetes can have very severe health outcomes.
## Questions of inference
Questions of inference are questions about how the world works. For example, we may want to know what effect an increase in atmospheric CO~2~ has summer temperatures in the Central Valley, or the severity of fire season. This won't tell us anything about what to wear next week, but it will inform whether or not we decide to move to a different state. For the diabetes dataset, we want to know what are the most useful risk factors to look for when assessing whether a person is likely to develop type II diabetes.
## Spurious Inference
Flying Spaghetti Monster
# Summarizing the Dataset
Before we do anything too fancy, it can be helpful to get a sense of what type of data we have. Are the variables numerical or categorical? Are the numeric variables integers are or decimal numbers? What range of values do the variables take? These questions are because the they will affect type of analysis we can do. There are many more types of analysis we can do on numeric data than on categorical data, but we don't always have that option.
```{python}
print(diabetes.describe())
```
Looking at the data, we can see something a little bit peculiar about it. All of the variables except for `target` have the same mean and standard deviation. This is definitely not normal. Also the variables are not very well labeled. In particular, 6 of the variables are labelled `s1`-`s6`. Not very informative. Thankfully we can look at the documentation to get some explanations.
```{python}
# get documenation on diabetes data
print(diabetes_data.DESCR)
```
The documentation gives us a number of pieces of information. First, the variable `target` is a measure of disease severity one year after diagnosis. So we will probably focus on predicting that variable. Also, we have descriptions of all of the variables, so we can label them in a way that is more helpful. Finally, there is a note stating that all the variables except `target` have been "mean centered and scaled by the standard deviation". What that means is that the researchers subtracted the mean and divided by the standard deviation for each data point in each of the 10 'predictor' variables. Researchers do this to make it easier to compare effect sizes between variables.
```{python}
namedict = {'s1': 'totchol', 's2':'ldl', 's3':'hdl', 's4':'tch',
's5':'ltg', 's6':'glucose'}
diabetes=diabetes.rename(columns=namedict)
print(diabetes.head())
```
# Visualizing The Dataset
Whenever we have a new dataset to analyze, a good first step is visualizing the variables in the dataset. For the Diabetes dataset, we are specifically interested in disease severity so we will be using `target`, a measure of diabetes disease progression. Basically it tells us how bad a person's diabetes is. Higher numbers mean the diabetes is worse. Figure 1 shows every variable in the dataset plotted against our measure of disease progression in scatterplots.
```{python}
x_names = ['Age', 'Sex', 'BMI', 'Blood Pressure', 'Total Cholestral',
"LDL ('Bad Cholesteral')", "HDL ('Good Cholesteral')", 'Total/HDL',
'Log(Serum Triglycerides Level)', 'Blood Sugar Level']
ncol = 2
nrow = int(len(x_names)/ncol)
fig1, axs1 = plt.subplots(nrow, ncol, sharey=True, sharex=True,
figsize=(8,10))
for i, ax in enumerate(fig1.axes):
ax.scatter(diabetes.iloc[:, i], diabetes.target)
ax.set_xlabel(x_names[i])
if i % 2 == 0:
ax.set_ylabel('Disease Progression')
#Add Title
fig1.suptitle('Figure 1. Scatterplots of Diabetes Covariates')
#removing space between the plots horizontally
plt.subplots_adjust(wspace=0, top=0.95, bottom=0.15)
plt.show()
```
# Questions
For this problem set please use the `mtcars` dataset from the `pydatasets` package to answer the following questions.
1. Reading the Documentation: the `pydatasets` package does not have very good documetation for the the datasets it contains. Thankfully, `mtcars` is well documented elsewhere.
a. Find an alternate source of documetation for the `mtcars` dataset on the internet.
b. Describe each variable in the `mtcars` dataset, including the units and data range if it is a numeric variable.
2. Prediction vs. Inference
a. What is a question of prediction we could ask about fuel efficiency using the `mtcars` data set.
b. What is an inferential question we could ask about fuel efficiency using the `mtcars` data set.
3. Visualization: create a figure with at least 3 scatterplot subplots, showing how fuel efficiency (mpg) changes.