Skip to content

Commit e170fd9

Browse files
committed
Problem Sets 1-6 in rmarkdown
0 parents  commit e170fd9

File tree

10 files changed

+1312
-0
lines changed

10 files changed

+1312
-0
lines changed

PS0.Rmd

+71
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: "PS 0 - Summary Statistics"
3+
author: "Elise Hellwig"
4+
date: "`r Sys.Date()`"
5+
output: pdf_document
6+
urlcolor: blue
7+
---
8+
9+
```{r setup, include=FALSE}
10+
knitr::opts_chunk$set(echo = TRUE)
11+
```
12+
13+
# Housekeeping
14+
15+
Try to install the numpy and pandas modules at home. You won't need them for this problem set but you will need them for the next one.
16+
17+
# What is Statistics?
18+
19+
Watch [**What is Statistics**](https://www.youtube.com/watch?v=sxQaBpKfDRk&list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr&index=2) and [**Mathematical Thinking**](https://www.youtube.com/watch?v=tN9Xl1AcSv8&list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr&index=3) before answering the following questions.
20+
21+
1. What are the two meanings of the word statistics?
22+
2. What is the difference between descriptive and inferential statistics? Give an example of a scenario where you would use each.
23+
3. What do you think Mark Twain meant when he said "There are lies, damn lies, and statistics"?
24+
25+
# Summary Statistics
26+
27+
One of the purposes of statistics is to take large amounts of data and summarize it with simple, easy to understand numbers. The way this is done is using what are called summary statistics, sometimes called descriptive statistics. Some of these you have probably already heard of, for example the mean. Others may not be as familiar, like the variance. This problem set will show you how to calculate summary statistics as well as detail what they are for.
28+
29+
You can use the following Sacramento, CA annual rainfall data (in mm) from the past 11 years if you wish, or you can find your own.
30+
31+
rain = [503, 200, 668, 483, 690, 581, 208, 469, 156, 576, 445, 569]
32+
33+
34+
## Measures of Central Tendency: Mean, Median and Mode
35+
36+
Watch [**Mean, Median and Mode: Measures of Central Tendency**](https://www.youtube.com/watch?v=kn83BA7cRNM&list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr&index=4) before answering the following questions. Code for answering these questions should only use base python, no numpy or statistics modules.
37+
38+
4. The Mean
39+
a. Write out an equation, or procedure, for calculating the mean.
40+
b. Define and use a function that calculates the mean of a list of data.
41+
c. What types of data are well summarized by the mean?
42+
d. What types of can we not calculate the mean for?
43+
5. The Median
44+
a. Write out the procedure for finding the median of a list of numbers.
45+
b. Define and use a function that calculates the median of a list of data
46+
c. For what types of data is the median a useful summary statistic?
47+
d. When would we prefer to use the median instead of the mean? Give a specific example.
48+
6. The Mode
49+
a. Write out a procedure for finding the mode of a list of data.
50+
b. Define and use a function that calculates the mode of the list of data.
51+
c. What types of data is the mode most useful for?
52+
d. Are there any types of data where we can't use the mode?
53+
7. **EC:** Why would it be preferable to save data in millimeters rather than inches (or centimeters)?
54+
55+
## Measures of Spread
56+
57+
Watch [**Measures of Spread**](https://www.youtube.com/watch?v=R4yfNi_8Kqw&list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr&index=6) before answering the following questions. Code for answering these questions should only use base python, no numpy or statistics modules.
58+
59+
8. What does a measure of spread tell us? What do we lose by leaving it out?
60+
61+
9. The Range
62+
a. Write a procedure to calculate the range.
63+
b. Define and use a function that calculates the range of a list of data.
64+
c. When is the range a good measure of spread to use? When would it be misleading?
65+
d. What is an alternative measure of spread that is less sensitive to this issue?
66+
67+
10. The Standard Deviation
68+
a. Write out the equation for the standard deviation for a list of numbers.
69+
b. Define and use a function to calculate the standard deviation of a list of data.
70+
c. Are there data types we can't calculate the standard deviation of?
71+
11. How might you measure the "spread" of a class of students' favorite colors?

PS1.Rmd

+71
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: "PS 1 - Introduction to Data in Python"
3+
author: "Elise Hellwig"
4+
date: "`r Sys.Date()`"
5+
output: pdf_document
6+
urlcolor: blue
7+
---
8+
9+
```{r setup, include=FALSE, echo=FALSE}
10+
11+
```
12+
13+
# Data Sources
14+
15+
The data you will be using for this problem set comes from the following sources.
16+
17+
* [**COVID-19 Time-Series Metrics by County and State**](https://data.chhs.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state) - Statewide COVID-19 Cases Deaths Tests
18+
19+
* [**NOAA Climate Data Online**](https://www.ncei.noaa.gov/cdo-web/search) - Daily Summaries - Station: Yosemite Park Headquarters, CA US (GHCND:USC00049855)
20+
21+
Citing your data sources is very important, because it is key for making your work reproducible.
22+
23+
24+
# Data Types
25+
As you saw in the last problem set, there is more than one type of data. Previously, we focused on the difference between numeric and non-numeric (categorical) data. However, depending on how you define it there are 4 or 5 different types of data.
26+
27+
1. Define the following types of data and give an example of each
28+
a. nominal
29+
b. ordinal
30+
c. logical
31+
d. discrete
32+
e. continuous
33+
34+
2. Which of those 5 could be considered a special case of one of the others?
35+
36+
3. Under which data type does a date fall?
37+
38+
# Data Formats
39+
Data can come in a variety of formats, both in how the data is stored (ex file type) and in how the different variables are represented.
40+
41+
## File Types
42+
43+
Data is stored in a number of file formats. The most common one we will see is CSV or Comma Separated Values. These are text files that use commas to separate different variables. If you open up your data in a text editor (like TextEdit or Notepad), as opposed to Excel or a similar program, you will see what this actually looks like.
44+
45+
4. What are the benefits and drawbacks of CSVs?
46+
47+
5. List 4 other file formats that data could be stored in and any notable benefits or drawbacks of those.
48+
49+
## Tidy Data
50+
51+
### Rectangular Data
52+
Whenever you are working with data, you want to keep it in a "rectangular" format. In rectangular data, each column represent a variable (ex, height, temperature, distance, color). Each row represents an observation, or the values of each of the variables under a particular set of conditions.
53+
54+
6. What are the variables in your dataset and what is each of their types?
55+
56+
7. What change in conditions does each observation represent?
57+
58+
### Types of Variables
59+
Generally speaking, there are two types of variables: key variables and value variables. Key variables tell you under what conditions the observation was taken. The value variables tell you what was actually observed. Both of these are independent of data types. A key variable can be any data type as can a value variable.
60+
61+
62+
8. What are the key variables in your dataset?
63+
64+
9. What are the value variables in your dataset?
65+
66+
10. There is a special type of variable called an ID or identifier variable. What do you think the purpose of this variable is?
67+
68+
11. Does your dataset have an ID variable?
69+
70+
### A Note on (In)dependence
71+
In statistics we will be talking a lot of about independent and dependent variables. Sometimes these correspond to key variables and value variables respectively, but not always. Additionally, with observational data (what we are using, as opposed to experimental data), it can be difficult to tell which way the arrow of causality is pointing. This means that while we may be able to say that one variable predicts another variable, knowing which one is the "independent" variable may be more difficult.

0 commit comments

Comments
 (0)