p8105_hw1_tmc2184.Rmd

---
title: "p8105_hw1_tmc2184"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

################################
# 09/10/2020
# Talea Cornelius
# Homework 1
################################

Problem .01: 
```{r}
# -create a public GitHub repo + local R Project; we suggest naming this repo / directory   
#  p8105_hw1_YOURUNI
# -create a single .Rmd file named p8105_hw1_YOURUNI.Rmd that renders to github_document
# -submit a link to your repo via Courseworks
```

https://github.com/taleac/p8105_hw1_tmc2184

###############################

Problem .02:

```{r}
# This “problem” focuses on correct styling for your solutions to Problems 1 and 2. We will look for:
# 
# -meaningful variable / object names
# -readable code (one command per line; adequate whitespace and indentation; etc)
# -clearly-written text to explain code and results
# -a lack of superfluous code (no unused variables are defined; no extra library calls; etc)
```


###############################

Problem 1:

```{r}
#First, load the tidyverse package:

library(tidyverse)
```

```{r}
# Next, create a data frame comprised of:
# 
# -a random sample of size 10 from a standard Normal distribution
# -a logical vector indicating whether elements of the sample are greater than 0
# -a character vector of length 10
# -a factor vector of length 10, with 3 different factor “levels”

p1_df = tibble(
  rand = rnorm(10, mean=0, sd=1),
  gt_zero = rand > 0,
  char = c("a","b","c","d","e",
           "f","g","h","i","j"),
  fact = factor(c("one","two","three","one","two",
                  "three","one","two","three","one"))
)

```

```{r}
#Try to take the mean of each variable in your dataframe. What works and what doesn't?

rand_mean = mean(pull(p1_df, rand))
log_mean = mean(pull(p1_df, gt_zero))
char_mean = mean(pull(p1_df, char))
fact_mean = mean(pull(p1_df, fact))
```
```{r}
rand_mean
log_mean
char_mean
fact_mean
```
The mean of the random number from the standard normal distribution is `r rand_mean`.
The mean of the logical vector indicating whether numbers are greater than zero is `r log_mean`.

The mean for the character and factor variables was not estimated. This is because you can't take a mean when a variable is not numeric (note that the logical vector is represented as 0=FALSE and 1=TRUE).


```{r include=FALSE}
#Write a code chunk that applies the as.numeric function to the logical, character, and factor variables (please show this chunk but not the output). What happens, and why? Does this help explain what happens when you try to take the mean?

p1_df$gt_zero_num <- as.numeric(p1_df$gt_zero)
p1_df$char_num <- as.numeric(p1_df$char)
p1_df$fact_num <- as.numeric(p1_df$fact)

```

This code converts the logical vector to 0 and 1 (representing FALSE and TRUE, respectively). It does not convert the character variable to a number, confirming the prior result (i.e., you can not take a mean when something is not a number). However, it does convert the factor to a numeric variable using the numbers 1, 2, and 3. The order assigned to these variables is alphabetical (i.e., "one" = 1, "two" = 3, and "three" = 2). This shows that you need to be careful that the appropriate numeric values are assigned if you do have an ordered factor and it is not coded in alphabetical order.


```{r}
#In a second code chunk:
#-convert the logical vector to numeric, and multiply the random sample by the result

p1_df$gt_zero_num*p1_df$rand

#-convert the logical vector to a factor, and multiply the random sample by the result

p1_df$gt_zero_fact <- as.factor(p1_df$gt_zero)
p1_df$gt_zero_fact*p1_df$rand

#-convert the logical vector to a factor and then convert the result to numeric, and multiply the random sample by the result

p1_df$gt_zero_fact_num <- as.numeric(p1_df$gt_zero_fact)
p1_df$gt_zero_fact_num*p1_df$rand

```

###############################

Problem 2:

```{r}
#install.packages("palmerpenguins")
data("penguins", package = "palmerpenguins")
```
```{r}
# Write a short description of the penguins dataset (not the penguins_raw dataset) using inline R code, including:
# 
# -the data in this dataset, including names / values of important variables
# -the size of the dataset (using nrow and ncol)
# -the mean flipper length

mean_flip <- mean(pull(penguins, flipper_length_mm), na.rm=TRUE)
```

The dataset "penguins" includes `r ncol(penguins)` variables and `r nrow(penguins)` observations. Specific variable names are `r names(penguins[,1])`, `r names(penguins[,2])`, `r names(penguins[,3])`, `r names(penguins[,4])`, `r names(penguins[,5])`, `r names(penguins[,6])`, `r names(penguins[,7])`, and `r names(penguins[,8])`.

`r names(penguins[,1])` is a factor with three levels: `r levels(penguins$species)`.
`r names(penguins[,2])` is a factor with three levels: `r levels(penguins$island)`.
`r names(penguins[,3])` is numeric, with a mean of `r mean(penguins$bill_length_mm, na.rm=TRUE)`.
`r names(penguins[,4])` is numeric, with a mean of `r mean(penguins$bill_depth_mm, na.rm=TRUE)`.
`r names(penguins[,5])` is numeric, with a mean of `r mean(penguins$flipper_length_mm, na.rm=TRUE)`.
`r names(penguins[,6])` is numeric, with a mean of `r mean(penguins$body_mass_g, na.rm=TRUE)`.
`r names(penguins[,7])` is a factor with two levels: `r levels(penguins$sex)`.
Finally, `r names(penguins[,8])` is numeric, with a mean of `r mean(penguins$year, na.rm=TRUE)`.

The mean flipper length in millimeters is `r round(mean_flip, digits=2)`.

```{r yx_scatter}
# Make a scatterplot of flipper_length_mm (y) vs bill_length_mm (x); color points using the species variable (adding color = ... inside of aes in your ggplot code should help).

ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm, color = species)) + geom_point()

# Export your first scatterplot to your project directory using ggsave
#ggsave("scatter_plot.pdf", height = 4, width = 6) ##commented out to knit file
```