Skip to content

Latest commit

 

History

History
185 lines (140 loc) · 8.44 KB

p8105_hw1_zdz2101.md

File metadata and controls

185 lines (140 loc) · 8.44 KB

p8105_hw1_zdz2101

Zelos Zhu 9/18/2018

Problem 1

library(tidyverse)
problem1_df <- tibble(numeric_vec = runif(n = 10, min = 0, max = 5),
                      logical_vec = numeric_vec > 2,
                      char_vec = c("My", "name", "is", "Zelos", "Zhu", "and", "I", "like", "to", "eat"),
                      factor_vec = factor(  c( rep( c("small", "medium", "large"), 3), "x-large")  )
                      )
print(problem1_df)
## # A tibble: 10 x 4
##    numeric_vec logical_vec char_vec factor_vec
##          <dbl> <lgl>       <chr>    <fct>     
##  1       3.81  TRUE        My       small     
##  2       4.33  TRUE        name     medium    
##  3       1.17  FALSE       is       large     
##  4       0.417 FALSE       Zelos    small     
##  5       0.880 FALSE       Zhu      medium    
##  6       2.82  TRUE        and      large     
##  7       0.666 FALSE       I        small     
##  8       1.95  FALSE       like     medium    
##  9       4.87  TRUE        to       large     
## 10       2.83  TRUE        eat      x-large
str(problem1_df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  4 variables:
##  $ numeric_vec: num  3.813 4.332 1.172 0.417 0.88 ...
##  $ logical_vec: logi  TRUE TRUE FALSE FALSE FALSE TRUE ...
##  $ char_vec   : chr  "My" "name" "is" "Zelos" ...
##  $ factor_vec : Factor w/ 4 levels "large","medium",..: 3 2 1 3 2 1 3 2 1 4
#means <- apply(problem1_df, 2, mean) --- doesn't work
#Above line coerces all into NA, probably because there is non-numeric in it, good to know

means_prob1 <- c(mean(problem1_df$numeric_vec),
                 mean(problem1_df$logical_vec),
                 mean(problem1_df$char_vec),
                 mean(problem1_df$factor_vec))
## Warning in mean.default(problem1_df$char_vec): argument is not numeric or
## logical: returning NA

## Warning in mean.default(problem1_df$factor_vec): argument is not numeric or
## logical: returning NA
means_prob1
## [1] 2.374621 0.500000       NA       NA

The mean() function successfully worked for the numeric and logical vectors of the data frame while it produced NAs for the character and factor vectors. This makes sense: the mean of the first vector could be calculated since it is a numeric variable. It's exactly what we think it should be: the sum of the elements divided by the number of elements. As a result the mean is 2.37.

Interestingly enough, the second vector in the data frame could also have a mean and that is because R seems to automatically consider FALSE into 0 and TRUE into 1. So, taking a "mean" of a logical allows us to see the proportion of TRUE cases, similarly if we were to take the sum we would see the total counts of TRUE's.

As expected, a mean could not be taken from neither the character or factor varaibles and that is because character and factor variables cannot be used with traditional "mathematical" functions. As a result taking their means produce a warning and the value we get is NA.

as.numeric(problem1_df$logical_vec)
as.numeric(problem1_df$char_vec)
as.numeric(problem1_df$factor_vec)

The logical vector turned into a vector of 0's and 1's, the FALSEs being 0, the TRUEs being 1.

The character vector turned into a vector of NA's. Though I imagine if you wrote out numbers in character form, they can convert properly. For example, executing the line

as.numeric(c("1","2","3"))

would properly display 1,2,3.

The factor vector turns in a vector that consist of the numbers 1, 2, 3, 4. I'm assuming R assigns each level of a factor a number and when you use the as.numeric() function with it, it pulls these numbers.

as.numeric(as.factor(problem1_df$char_vec))
as.numeric(as.character(problem1_df$factor_vec))

When you convert the character vector into a factor, it turns into a vector of 10 levels because I had 10 unique values, in which case when as.numeric() is applied it turns into a scrambled mess of the numbers 1 to 10.

When you convert the factor vector into character and then attempt to use as.numeric() it displays a warning and we get a vector of NA's.

Problem 2

problem2_df <- tibble( x = rnorm(1000),
                       y = rnorm(1000),
                       is_sum_positive = (x + y) > 0,
                       num_coerce = as.numeric(is_sum_positive),
                       factor_coerce = as.factor(is_sum_positive))
print(head(problem2_df))
## # A tibble: 6 x 5
##         x      y is_sum_positive num_coerce factor_coerce
##     <dbl>  <dbl> <lgl>                <dbl> <fct>        
## 1 -0.0831 -0.128 FALSE                    0 FALSE        
## 2  0.218  -0.442 FALSE                    0 FALSE        
## 3 -0.361  -0.593 FALSE                    0 FALSE        
## 4  0.541  -0.171 TRUE                     1 TRUE         
## 5  0.986   1.40  TRUE                     1 TRUE         
## 6  0.0600 -0.452 FALSE                    0 FALSE
str(problem2_df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  5 variables:
##  $ x              : num  -0.0831 0.2181 -0.3608 0.5411 0.9862 ...
##  $ y              : num  -0.128 -0.442 -0.593 -0.171 1.4 ...
##  $ is_sum_positive: logi  FALSE FALSE FALSE TRUE TRUE FALSE ...
##  $ num_coerce     : num  0 0 0 1 1 0 1 0 0 1 ...
##  $ factor_coerce  : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 2 2 1 2 1 1 2 ...

The size of the dataset is 1000 rows (observations) and 5 columns (variables). The mean of x is 0. The median of x is 0.01. The proportion of cases for which the logical vector is TRUE is 0.5.

hw1plot_1 <- (ggplot(problem2_df, aes(x=x, y=y))
            + geom_point( aes(color = is_sum_positive))
            + ggtitle("Scatterplot of 1000 points based on normal distribution")
            + geom_abline(intercept = 0, slope = -1)
            + theme(plot.title = element_text(size = 10, face = "bold")))

hw1plot_2 <- (ggplot(problem2_df, aes(x=x, y=y))
            + geom_point( aes(color = num_coerce))
            + ggtitle("Scatterplot of 1000 points based on normal distribution")
            + geom_abline(intercept = 0, slope = -1)
            + theme(plot.title = element_text(size = 10, face = "bold")))

hw1plot_3 <- (ggplot(problem2_df, aes(x=x, y=y))
            + geom_point( aes(color = factor_coerce))
            + ggtitle("Scatterplot of 1000 points based on normal distribution")
            + geom_abline(intercept = 0, slope = -1)
            + theme(plot.title = element_text(size = 10, face = "bold")))

hw1plot_1

ggsave(filename = "hw1plot1.png")
## Saving 7 x 5 in image
hw1plot_2

hw1plot_3

#library(gridExtra) --- for my viewing pleasure when I was working on this so I don't have to scroll through in viewer
#grid.arrange(hw1plot_1, hw1plot_2, hw1plot_3, nrow=1)

The color arrangement seems to be the exact same for the logical and factor variables, each point has a distinct color allocated based on the discrete values TRUE and FALSE (in this case a light red and cyan).

MY hypothesis/guess on why this is, is that similar to how R automatically register logicals as 0/1's for mathematical functions, that in this setting, ggplot2 is written such that when logicals enter an aes, it automatically considers it a factor of 2 levels: TRUE/FALSE which have correspoding numbers to them, in this case 1 and 2. This would then allow R to have an index to pull certain "colors" from some sort of color vector. Otherwise with FALSE being 0, it would be difficult to use that as an index since R objects start indexing at 1. (Just my guess)

In regards to the second plot where we have the blue and black dots, it seems in this case our color is based on some sort of gradient, the range of which is defined as 0 to 1. In our case we only have exactly 0's and 1's because we coerced it from logical to numeric. As a result we have a concentration from each of the extremes of the gradient. Say if we plugged in x as our color we would see this entire black(low)-blue(high) gradient generated going left to right. The same could be said for if we plugged in y instead, but we would generate that gradient from bottom to top.