diff --git a/04-tidyr.md b/04-tidyr.md index 5bebd0efc..77ebb91ba 100644 --- a/04-tidyr.md +++ b/04-tidyr.md @@ -118,16 +118,16 @@ interviews %>% # A tibble: 10 × 4 key_ID village interview_date instanceID - 1 46 Chirodzo 2016-11-17 00:00:00 uuid:35f297e0-aa5d-4149-9b7b-4965004cfc37 - 2 43 Chirodzo 2016-11-17 00:00:00 uuid:b4dff49f-ef27-40e5-a9d1-acf287b47358 - 3 67 Chirodzo 2016-11-16 00:00:00 uuid:6c15d667-2860-47e3-a5e7-7f679271e419 - 4 199 Chirodzo 2017-06-04 00:00:00 uuid:ffc83162-ff24-4a87-8709-eff17abc0b3b - 5 9 Chirodzo 2016-11-16 00:00:00 uuid:846103d2-b1db-4055-b502-9cd510bb7b37 - 6 56 Chirodzo 2016-11-16 00:00:00 uuid:973c4ac6-f887-48e7-aeaf-4476f2cfab76 - 7 54 Chirodzo 2016-11-16 00:00:00 uuid:273ab27f-9be3-4f3b-83c9-d3e1592de919 - 8 45 Chirodzo 2016-11-17 00:00:00 uuid:e3554d22-35b1-4fb9-b386-dd5866ad5792 - 9 58 Chirodzo 2016-11-16 00:00:00 uuid:a7a3451f-cd0d-4027-82d9-8dcd1234fcca -10 57 Chirodzo 2016-11-16 00:00:00 uuid:a7184e55-0615-492d-9835-8f44f3b03a71 + 1 64 Chirodzo 2016-11-16 00:00:00 uuid:28cfd718-bf62-4d90-8100-55fafbe45d06 + 2 36 Chirodzo 2016-11-17 00:00:00 uuid:c90eade0-1148-4a12-8c0e-6387a36f45b1 + 3 34 Chirodzo 2016-11-17 00:00:00 uuid:14c78c45-a7cc-4b2a-b765-17c82b43feb4 + 4 21 Chirodzo 2016-11-16 00:00:00 uuid:cc7f75c5-d13e-43f3-97e5-4f4c03cb4b12 + 5 46 Chirodzo 2016-11-17 00:00:00 uuid:35f297e0-aa5d-4149-9b7b-4965004cfc37 + 6 54 Chirodzo 2016-11-16 00:00:00 uuid:273ab27f-9be3-4f3b-83c9-d3e1592de919 + 7 69 Chirodzo 2016-11-16 00:00:00 uuid:f86933a5-12b8-4427-b821-43c5b039401d + 8 66 Chirodzo 2016-11-16 00:00:00 uuid:a457eab8-971b-4417-a971-2e55b8702816 + 9 61 Chirodzo 2016-11-16 00:00:00 uuid:2401cf50-8859-44d9-bd14-1bf9128766f2 +10 200 Chirodzo 2017-06-04 00:00:00 uuid:aa77a0d7-7142-41c8-b494-483a5b68d8a7 ``` We notice that the layout or format of the `interviews` data is in a format that @@ -367,13 +367,13 @@ other with "solar panel" in the `items_owned` column. separate_rows(items_owned, sep = ";") %>% ``` -You may notice that one of the columns is called `´NA´`. This is because some -of the respondents did not own any of the items that was in the interviewer's -list. We can use the `replace_na()` function to change these `NA` values to -something more meaningful. The `replace_na()` function expects for you to give -it a `list()` of columns that you would like to replace the `NA` values in, -and the value that you would like to replace the `NA`s. This ends up looking -like this: +You may notice that the `items_owned` column contains `NA` values. +This is because some of the respondents did not own any of the items that was in +the interviewer's list. We can use the `replace_na()` function to change these +`NA` values to something more meaningful. The `replace_na()` function expects +for you to give it a `list()` of columns that you would like to replace the `NA` +values in, and the value that you would like to replace the `NA`s. This ends up +looking like this: ```r diff --git a/data-visualisation-handout.md b/data-visualisation-handout.md new file mode 100644 index 000000000..d51c90a9b --- /dev/null +++ b/data-visualisation-handout.md @@ -0,0 +1,293 @@ +--- +title: Code Handout - Data Visualisation with ggplot2 +output: + html_document: + df_print: paged + code_download: yes +--- + +This document contains all of the functions that we have covered thus far in the +course. It will be updated every week, after we've added new skills. Each +function is presented alongside an example of how it is used. + +All of the examples below are in the context of the Palmer Penguins, found +[here (link)](https://allisonhorst.github.io/palmerpenguins/index.html). + + + +## Foundations of `ggplot()` + +- `ggplot()` -- a function to create the shell of a visualization, where + specific variables are mapped to different aspects of the plot + + +```r +penguins %>% + ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +``` + +- `aes()` -- aesthetics that can be used when creating a `ggplot()`, where the + aesthetics can either be hard coded (e.g. `color = "blue"`) or associated with + a variable (e.g. `color = sex`). + + - The following are the aesthetic options for *most* plots: + - `x` + - `y` + - `alpha` -- changes transparency + - `color` -- produces colored outline + - `fill` -- fills with color + - `group` -- used with categorical variables, similar to color + +- **`+`** -- an important aspect creating a `ggplot()` is to note that the + `geom_XXX()` function is separated from the `ggplot()` function with a plus + sign, `+`. + + - `ggplot()` plots are constructed in series of layers, where the plus sign + separates these layers. + - Generally, the `+` sign can be thought of as the end of a line, so you + should always hit enter/return after it. While it is not mandatory to move + to the next line for each layer, doing so makes the code a lot easier to + organize and read. + + +```r +penguins %>% + ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + + geom_point() +``` + +## Geometric Objects to Visualize the Data + +- `geom_histogram( )` -- adds a histogram to the plot, + where the observations are binned into ranges of values and then frequencies + of observations are plotted on the y-axis + - You can specify the number of bins you want with the `bins` argument + + +```r +penguins %>% + ggplot(aes(x = bill_length_mm)) + + geom_histogram(bins = 20) +``` + +- `geom_boxplot( )` -- adds a boxplot to the plot, where observations are + aggregated (summarized), the min, Q1, median, Q3, and maximum are plotted as the + box and whiskers, and "outliers" are plotted as points. + - You can plot a vertical boxplot by specifying the `x` variable, or a + horizontal boxplot by specifying the `y` variable. + - Note: the min and max may not be included in the whiskers, if they are + deemed to be "outliers" based on the $1.5 \\times \\text{IQR}$ rule. + + +```r +## Horizontal boxplot +penguins %>% + ggplot(aes(x = bill_length_mm)) + + geom_boxplot() + +## Vertical boxplot +penguins %>% + ggplot(aes(y = bill_length_mm)) + + geom_boxplot() +``` + +- `geom_density()` -- adds a density curve to the plot, where the probability + density is plotted on the y-axis (so the density curve has a total area of one). + - By default this creates a density curve without shading. By specifying a + color in the `fill` argument, the density curve is shaded. + - Can be thought of as the "one group" violin plot! + + +```r +penguins %>% + ggplot(aes(x = bill_length_mm)) + + geom_density(fill = "tomato") +``` + +- `geom_violin()` -- plots violins for each level of a categorical variable + - Can be thought of as a hybrid mix of `geom_boxplot()` and `geom_density()`, + as the density is displayed, but it is reflected to provide a plot similar in + nature to a boxplot. + - To obtain violins stacked vertically, declare the categorical variable as `y`. + To obtain side-by-side violins, declare the categorical variable as `x`. + + +```r +## Stacked vertically +penguins %>% + ggplot(aes(x = bill_length_mm, y = species)) + + geom_violin() + +## Side-by-side +penguins %>% + ggplot(aes(y = bill_length_mm, x = species)) + + geom_violin() +``` + +- `geom_bar()` -- creates a barchart of a categorical variable + - Can produce stacked barcharts by specifying a variable as the `fill` + aesthetic. + - Can change from stacked barchart to a side-by-side barchart by specifying + `position = "dodge"`. + - If your data are already in counts (e.g. output from `count()`), then you + can specify the `stat = "identity"` argument inside `geom_bar()`. + + +```r +## Stacked barchart +penguins %>% + ggplot(aes(x = species)) + + geom_bar(aes(fill = sex)) + +## Side-by-side barchart +penguins %>% + ggplot(aes(x = species)) + + geom_bar(aes(fill = sex), + position = "dodge") + +## If data are raw counts +penguins %>% + count(species, sex) %>% + ggplot(aes(x = species, y = n)) + + geom_bar(aes(fill = sex), + stat = "identity", + position = "dodge") +``` + +- `geom_point()` -- plots each observation as an (x, y) point, used to create + scatterplots + - Can use `alpha` to increase the transparency of the points, to reduce + overplotting. + - Can specify `aes`thetics inside of `geom_point()` for local aesthetics (point + level) or inside of `ggplot()` for global aesthetics (plot level) + + +```r +penguins %>% + ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + + geom_point(aes(color = species)) +``` + +- `geom_jitter()` -- plots each observation as an (x, y) point and adds a small + amount of jitter around the point + - Useful so that we can see each point in the locations where there are + overlapping points. + - Can specify the `width` and `height` of the jittering using the optional + arguments. + + +```r +penguins %>% + ggplot(aes(y = body_mass_g, x = species)) + + geom_violin() + + geom_jitter(aes(color = sex), width = 0.25, height = 0.25) +``` + +- `geom_smooth()` -- plots a line over a set of points, draws the readers eye + to a specific trend + - The methods we will use are "lm" for a linear model (straight line), and + "loess" for a wiggly line + - By default, the smoother gives you gray SE bars, to remove these add + `se = FALSE` + + +```r +penguins %>% + ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + + geom_point() + + geom_smooth(method = "lm") +``` + +- `facet_wrap()` -- creates subplots of your original plot, based on the levels + of the variable you input + - To facet by one variable, use `~variable`. + - To facet by two variables, use `variable1 ~ variable2`. + - If you prefer for your facets to be organized in rows or columns, use the + `nrow` and/or `ncol` arguments. + + +```r +penguins %>% + ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + + geom_point() + + geom_smooth(method = "lm") + + facet_wrap(~island, nrow = 1) +``` + +## Plot Characteristics + +- `labs()` -- specifies the plot labels, possible labels are: x, y, color, fill, + title, and subtitle + + +```r +penguins %>% + ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + + geom_point() + + geom_smooth(method = "lm") + + labs(x = "Bill Length (mm)", + y = "Bill Depth (mm)", + color = "Penguin Species") +``` + +- `theme_bw()` -- changes the plotting background to the classic dark-on-light + ggplot2 theme. + - This theme may work better for presentations displayed with a projector. + - Other theme options are `theme_minimal()`, `theme_light()`, and `theme_void()`. + + +```r +penguins %>% + ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + + geom_point() + + geom_smooth(method = "lm") + + labs(x = "Bill Length (mm)", + y = "Bill Depth (mm)", + color = "Penguin Species") + + theme_bw() +``` + +- `theme()` -- + - Possible options are: + - `panel.grid` -- controls the grid lines (`panel.grid = element_blank()` + removes grid lines) + - `text` -- specifies font size for the entire plot (e.g. + `text = element_text(size = 16)` + - `axis.text.x` -- specifies the font size for the x-axis text + - `axis.text.y` -- specifies the font size for the y-axis text + - `plot.title` -- specifies aspects of the plot title, can use + `plot.title = element_text(hjust = 0.5)` to centre the title + + +```r +penguins %>% + ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + + geom_point() + + geom_smooth(method = "lm") + + labs(x = "Bill Length (mm)", + y = "Bill Depth (mm)", + color = "Penguin Species") + + theme_bw() + + theme(axis.text.x = element_text(size = 12), + axis.text.y = element_text(size = 12)) +``` + +## Exporting Plots + +- `ggsave()` -- convenient function for saving a plot + - Unless specified, defaults to the last plot that was made. + - Uses the size of the current graphics device to determine the size of the + plot. + + +```r +plot1 <- penguins %>% + ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + + geom_point() + + geom_smooth(method = "lm") + + facet_wrap(~island, nrow = 1) + +ggsave(path = "images/faceted_plot.png", plot = plot1) +``` + + diff --git a/data-wrangling-handout.md b/data-wrangling-handout.md new file mode 100644 index 000000000..533da1cd2 --- /dev/null +++ b/data-wrangling-handout.md @@ -0,0 +1,284 @@ +--- +title: Code Handout - Data Wrangling with dplyr & tidyr +output: + html_document: + df_print: paged + code_download: yes +--- + + + +This document contains all of the functions that we have covered thus far in the +course. It will be updated every week, after we've added new skills. Each +function is presented alongside an example of how it is used. + +All of the examples below are in the context of the Palmer Penguins, found +[here (link)](https://allisonhorst.github.io/palmerpenguins/index.html). + +## Packages + +- `library()` -- loads packages into your `R` session + + +```r +library(tidyverse) +library(palmerpenguins) +``` + +## Inspecting Data + +- `glimpse()` -- shows a summary of the dataset, the number of rows and columns, + variable names, and the first 10 entries of each variable + + +```r +glimpse(penguins) +``` + +## Working with Data + +- `<-` -- "assignment arrow", assigns a value (vector, dataframe, single value) + to the name of a variable + + +```r +penguins_2007 <- penguins %>% + filter(year == 2007) +``` + +- `c()` -- the "concatenate" function combines inputs to form a vector, the + values have to be the same data type. + + +```r +cat_variables <- c("Species", "Island", "Sex") +``` + +\\newpage + +## Verbs of Data Wrangling + +- `select()` -- selects variables (columns) from a dataframe + + +```r +penguins %>% +select(species) +``` + +- `filter()` -- filters observations (rows) out of / into a dataframe, where + the inputs (arguments) are the conditions to be satisfied in the data that are + kept + + +```r +## It's nice to have a new line for each condition, so your code is easier to read! +penguins %>% +filter(species == "Adelie", + body_mass_g > 3000, + year == 2008) +``` + +**Logical operators:** Filtering for certain observations (e.g. flights from a +particular airport) is often of interest in data frames where we might want to +examine observations with certain characteristics separately from the rest of +the data. To do so, you can use the `filter` function and a series of **logical +operators**. The most commonly used logical operators for data analysis are as +follows: + +- `==` means "equal to" + +- `!=` means "not equal to" + +- `>` or `<` means "greater than" or "less than" + +- `>=` or `<=` means "greater than or equal to" or "less than or equal to" + +- `mutate()` -- creates new variables or modifies existing variables + + +```r +penguins %>% + filter(is.na(bill_length_mm) != TRUE, + is.na(bill_depth_mm) != TRUE) %>% + mutate(body_mass_kg = body_mass_g / 1000) +``` + +- `group_by()` -- groups the dataframe based on levels of a categorical variable, + usually used alongside `summarize()` + + +```r +penguins %>% + group_by(island) +``` + +- summarize()`-- creates data summaries of variables in a dataframe, for grouped summaries use alongside`group\_by()\` + + +```r +penguins %>% + filter(is.na(body_mass_g) != TRUE) %>% + group_by(island) %>% + summarize(mean_mass = mean(body_mass_g)) +``` + +- `ungroup()` -- removes the grouping of a dataframe, typically used after group + summaries when additional ungrouped operations are required + + +```r +penguins %>% + filter(is.na(body_mass_g) != TRUE) %>% + group_by(island) %>% + summarize(mean_mass = mean(body_mass_g)) %>% + ungroup() +``` + +- `arrange()` -- orders a dataframe based on the values of a numerical variable, + paired with `desc()` to order in descending order + + +```r +penguins %>% + filter(is.na(body_mass_g) != TRUE) %>% + group_by(island) %>% + summarize(mean_mass = mean(body_mass_g)) %>% + arrange(desc(mean_mass)) +``` + +- `%>%` -- the "pipe" operator, joins sequences of data wrangling steps together, + works with any function that has `data = ` as the first argument + + +```r +penguins %>% + select(species, island, body_mass_g, sex, year) %>% + filter(island == "Torgersen", + is.na(body_mass_g) != TRUE) %>% + group_by(species, year) %>% + summarize(mean_mass = mean(body_mass_g), + median_mass = median(body_mass_g), + observations = n()) %>% + arrange(desc(mean_mass)) +``` + +## Other Data Wrangling Tools + +- `count()` -- counts the number of observations (rows) of the different levels + of a categorical variable + - can add `sort = TRUE` to sort the table in descending order (similar to + using `arrange(desc())` ) + + +```r +penguins %>% +count(species) +``` + +- `mean()` -- finds the mean of a numerical variable, not resistant to `NA` values, + so either filter out prior or use `na.omit = TRUE` argument + + - Other summary functions include: + - `var()` -- find the variance of a numerical variable + - `sd()` -- finds the standard deviation of a numerical variable + - `IQR()` -- find the innerquartile range (Q3 - Q1) of a numerical variable + - `median()` -- finds the median of a numerical variable + +- `is.na()` -- returns a vector of `TRUE` and `FALSE` values corresponding to + whether a particular row of a variable was `NA` (missing) + + +```r +penguins %>% + mutate(missing_weight = is.na(body_mass_g)) +``` + +- `sample_n()` -- selects $n$ rows from the dataframe, based on the value of + `size` specified + + +```r +penguins %>% + sample_n(size = 10) +``` + +- `replace_na()` -- replaces NA values with the value specified + - The values to be replaced must be passed to the function (input) as a + `list()` object. + + +```r +penguins %>% + replace_na(list(bill_length_mm = "no_measurement", + bill_depth_mm = "no_measurement")) %>% + glimpse() +``` + +- `separate_rows()` -- separates a variable with multiple values based on the + delimiter specified. + + - Variables whose entries are stored as a list with commas or semicolons are + great candidates for this function! + +- `rowSums()` -- forms row sums for numeric variables + + - Note: In the lesson `rowSums()` was used on a `logical` variable, because + logical values can be numerically represented as 0 (FALSE) and 1 (TRUE) + + +```r +x <- tibble(x1 = 3, x2 = c(4:1, 2:5)) +rowSums(x) +``` + +## Pivoting Dataframes + +- `pivot_wider()` -- transforms a dataframe from long to wide format + - takes three principal arguments: + 1. the data + 2. the *names\_from* column variable whose values will become new column names + 3. the *values\_from* column variable whose values will fill the new column + variables. + - Further arguments include `values_fill` which, if set, fills in missing + values with the value provided. + + +```r +wide <- penguins %>% + mutate(island_logical = TRUE) %>% + pivot_wider(names_from = species, + values_from = island_logical, + values_fill = list(island_logical = FALSE)) + +glimpse(wide) +``` + +- `pivot_longer()` -- transforms a dataframe from wide to long format + - takes four principal arguments: + 1. the data + 2. *cols* are the names of the columns we use to fill the a new values variable + (or to drop). + 3. the *names\_to* column variable we wish to create from the *cols* provided. + 4. the *values\_to* column variable we wish to create and fill with values + associated with the *cols* provided. + + +```r +wide %>% + pivot_longer(cols = Adelie:Gentoo, + names_to = "species", + values_to = "island_logical") +``` + +## Extracting Data + +- `write_csv()` -- writes a dataframe to a csv file, output into the file path + specified + + +```r +write_csv(wide, path = "data/penguins_wide.csv") +``` + + diff --git a/intro-R-handout.md b/intro-R-handout.md new file mode 100644 index 000000000..43d28f907 --- /dev/null +++ b/intro-R-handout.md @@ -0,0 +1,212 @@ +--- +title: Code Handout - Introduction to R +output: md_document +--- + + + +This document contains all of the functions that were covered in the +*Introduction to R* workshop. Each function is presented alongside an example of +how it can be used. + +## Creating Objects + +- `<-` -- "assignment arrow", assigns a value (vector, dataframe, single value) + to the name of a variable + + +```r +x <- 3 +y <- c(1, 2, 3) +z <- x + y +``` + +- `c()` -- the "concatenate" function combines inputs to form a vector, the + values have to be the same data type. + + +```r +animals <- c("bird", "cat", "dog") +numbers <- c(1, 14, 57, 89) +logicals <- c(TRUE, FALSE, TRUE, TRUE) +``` + +## Inspecting Objects + +- `str()` -- compact display of the structure of an R object + + +```r +str(animals) +``` + +- `class()` -- returns the type of element of any R object + + +```r +class(logicals) +``` + +- `typeof()` -- returns the data type or storage mode of any R object + + +```r +typeof(numbers) +``` + +## Functions in R + +- `args()` -- returns the arguments of a function + + +```r +args(round) +``` + +- named arguments -- the name of the argument the function expects + - You can choose to not name your arguments, **if** you know the **exact** + order they should be in! + - However, we generally discourage this. + + +```r +## Either of these work, since the digits argument is named explicitly. +round(3.14159, digits = 2) +round(digits = 2, 3.14159) + +## This does not work, since the arguments are not named and in the incorrect order. +round(2, 3.14159) +``` + +## Functions to Summarize Data + +- `sqrt()` -- returns the square root of a numeric variable + + +```r +sqrt(numbers) +``` + +- `mean()` -- returns the mean of a numeric variable + - You can add the `na.rm` argument, to remove `NA` values before calculating + the mean. + + +```r +sqrt(numbers) +``` + +- `max()` -- returns the maximum of a numeric variable + - You can add the `na.rm` argument, to remove `NA` values before calculating + the max. + + +```r +sqrt(numbers) +``` + +- `sum()` -- returns the sum of a numeric variable + - You can add the `na.rm` argument, to remove `NA` values before calculating + the sum. + + +```r +sqrt(numbers) +``` + +- `length()` -- returns the length of a vector (of any datatype) + + +```r +length(animals) +``` + +## Subsetting Data + +- `[]` -- used to subset elements from a vector + + +```r +animals[3] +## selects the third element + +animals[2:3] +## selects the second and third element + +animals[c(1, 3)] +## selects the first and third element +``` + +- relational operators -- return logical values indicating where a relation is + satisfied. The most commonly used logical operators for data analysis are as follows: + - `==` means "equal to" + - `!=` means "not equal to" + - `>` or `<` means "greater than" or "less than" + - `>=` or `<=` means "greater than or equal to" or "less than or equal to" + + +```r +animals == "dog" + +animals != "cat" + +numbers > 4 + +numbers <= 12 +``` + +- logical operators -- join subset criteria together + - `&` means "and" -- where two criteria must **both** be satisfied + - `|` means "or" -- where at least one criteria must be satisfied + + +```r +numbers > 4 & numbers < 20 + +animals == "dog" | animals == "cat" +``` + +- `%in%` -- the "inclusion operator", allows you to test if any of the elements + of a search vector (on the left hand side) are found in the target vector (on + the right hand side). + - The levels of the target vector must be included in a vector (`c()`). + + +```r +possessions <- c("car", "bicycle", "radio", "television", "mobile_phone") + +possessions %in% c("car", "bicycle", "motorcycle") +``` + +## Missing Data + +- `is.na()` -- returns a vector of logical values indicating which elements of + a vector have `NA` values + - Often combined with `!`, where the `!` negates the previous statement (e.g. + `!TRUE` is equal to `FALSE`). + + +```r +missing <- c(1, 3, NA, 7, 12, NA) + +is.na(missing) + +!is.na(missing) +``` + +- `na.omit()` -- removes the observations with `NA` values + + +```r +na.omit(missing) +``` + +- `complete.cases()` -- returns a vector of logical values indicating which + elements of a vector **are not** missing (`NA`) values + + +```r +complete.cases(missing) +``` + + diff --git a/md5sum.txt b/md5sum.txt index 6618ddeba..28deb4b08 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -7,11 +7,15 @@ "episodes/01-intro-to-r.Rmd" "23d7135e2cc8d6412e87ca12e4a22446" "site/built/01-intro-to-r.md" "2023-07-10" "episodes/02-starting-with-data.Rmd" "1cc16a683b4fdbfa5a5c2e85f3bb324b" "site/built/02-starting-with-data.md" "2023-07-10" "episodes/03-dplyr.Rmd" "1afb9c880a904ce10789ec6a607922a6" "site/built/03-dplyr.md" "2023-07-10" -"episodes/04-tidyr.Rmd" "b630435fafa0871e975cee0cb0557594" "site/built/04-tidyr.md" "2023-07-10" +"episodes/04-tidyr.Rmd" "b4fe0fcd6f363a72ddd98492fe8e77e3" "site/built/04-tidyr.md" "2023-07-25" "episodes/05-ggplot2.Rmd" "f5eab52ad1a54fec7d1abaab1ba53bee" "site/built/05-ggplot2.md" "2023-07-10" "episodes/06-rmarkdown.Rmd" "f547d49b8f79fbcc6fbe8a78e512df61" "site/built/06-rmarkdown.md" "2023-07-10" "episodes/07-json.Rmd" "e5aeddc239ecdb8549653c16e38823f9" "site/built/07-json.md" "2023-07-10" +"instructors/data-visualisation-handout.Rmd" "af73107d4c87c96857e7d1a803f12013" "site/built/data-visualisation-handout.md" "2023-07-25" +"instructors/data-wrangling-handout.Rmd" "03ecc7a97f4207b7bc80c78ecbcfcb69" "site/built/data-wrangling-handout.md" "2023-07-25" "instructors/instructor-notes.md" "10d2ddec20a1ef8f2537f4002342f6ac" "site/built/instructor-notes.md" "2023-07-10" +"instructors/intro-R-handout.Rmd" "3648cf0b516f4c68c6729e188d8f3667" "site/built/intro-R-handout.md" "2023-07-25" +"instructors/starting-with-data-handout.Rmd" "b7deeaa71b74f89ae011c8159e77b36a" "site/built/starting-with-data-handout.md" "2023-07-25" "learners/reference.md" "84ec59e74313499cccfab60a73fd0b13" "site/built/reference.md" "2023-07-10" "learners/R-handout.Rmd" "3eb70f6b3d7998e1109e7ad91c822fd8" "site/built/R-handout.md" "2023-07-11" "learners/setup.md" "b879b810b382d3df882dc7d521398454" "site/built/setup.md" "2023-07-10" diff --git a/starting-with-data-handout.md b/starting-with-data-handout.md new file mode 100644 index 000000000..8d2c98944 --- /dev/null +++ b/starting-with-data-handout.md @@ -0,0 +1,240 @@ +--- +title: Code Handout - Starting with Data +output: + html_document: + df_print: paged + code_download: yes +--- + + + +This document contains all of the functions that were covered in the +*Introduction to R* workshop. Each function is presented alongside an example of +how it can be used. + +All of the examples below are in the context of the Palmer Penguins, found +[here (link)](https://allisonhorst.github.io/palmerpenguins/index.html). + +## Packages + +- `library()` -- loads packages into your `R` session + + +```r +library(tidyverse) +library(lubridate) +``` + +## Importing Data + +- `read_csv()` -- function to import a csv file. + - First argument is the path to the data, passed as a character + (inside quotations). + - You can specify what values should be considered missing, using the `na` + argument. + + +```r +penguins <- read_csv("data/penguins.csv") +``` + +## Inspecting Data + +- `dim()` - returns a vector with the number of rows as the first element, + and the number of columns as the second element (the **dim**ensions of + the object) + + +```r +dim(penguins) +``` + +- `nrow()` - returns the number of rows +- `ncol()` - returns the number of columns + + +```r +nrow(penguins) +ncol(penguins) +``` + +- `head()` - displays the first 6 rows of the dataframe +- `tail()` - displays the last 6 rows of the dataframe + + +```r +head(penguins) +tail(penguins) +``` + +- `names()` - returns the all of the names of an object (both row and column) +- `colnames()` - returns column names for dataframes (without row names) + + +```r +names(penguins) +colnames(penguins) +``` + +- `glimpse()` - provides a preview of the data, where column names are presented + with their associated data types, and the entries from each column are printed + in each row + + +```r +glimpse(penguins) +``` + +- `str()` - returns the structure of the object and information about the class, + the names and data types of each column, and a preview of the first entries of + each column + + +```r +str(penguins) +``` + +- `summary()` - provides summary statistics for each column + - Note: summary statistics for character variables are not meaningful, as they + simply state the number of observations (length) of the variable + + +```r +summary(penguins) +``` + +## Subsetting Data + +- `[]` -- selects rows and columns from a dataframe + - The first entry is the row number, the second entry is the column number(s), + and they are separated with a comma. + + +```r +## Selects the element in the first row, second column +penguins[1, 2] + +## Selects every element in the fourth row +penguins[4, ] + +## Selects every element in the third column +penguins[, 3] +``` + +- `[[]]` -- selects a column from a dataframe + - Inside the brackets you can pass either the number of the column or the + name of the column (in quotations) + + +```r +penguins[[1]] + +penguins[["island"]] +``` + +- `$` -- selects a column from a dataframe, where the name of the dataframe is + on the left and the name of the column is on the right + + +```r +penguins$body_mass_g +``` + +## Working with Different Data Types + +- `factor()` -- creates a categorical variable from a character or numeric + variable, variable has a factor datatype + - the values (level) of the factor levels is specified in the `levels` + argument, where the levels must be specified in a vector (using `c()`) + - Note: the order you wish for the levels to appear is how you should list + them in the `levels` argument, you can also specify `ordered = TRUE` to + ensure the levels remain in this order + + +```r +penguins$year_fct <- factor(penguins$year, + levels = c("2007", "2008", "2009"), + ordered = TRUE) +``` + +- `as.factor()` -- creates a categorical variable from a character or numeric + variable, variable has a factor datatype + - does not allow for you to specify the order of the levels + - defaults to alphabetical ordering for factor levels + + +```r +penguins$year_fct <- as.factor(penguins$year) +``` + +- `levels()` -- returns the levels of a variable with a factor datatype, in the + order they were stored + - Note: this function will not work for character datatypes + + +```r +levels(penguins$year_fct) +``` + +- `nlevels()` -- returns the number of levels of a variable with a factor + datatype + - Note: this function will not work for character datatypes + + +```r +nlevels(penguins$year_fct) +``` + +- `as.character()` -- creates a character variable from a numeric or factor + variable + + +```r +penguins$species_chr <- as.character(penguins$species) +``` + +- `ymd()` -- transforms dates stored as character or numeric variables to dates + - Note: to use this function, dates must be stored in year-month-day format + - The function does well with heterogeneous formats (as seen below), but + formats where some of the entries are not in double digits may not be parsed + correctly. + + +```r +x <- c("2009-01-01", "2009-01-02", "2009-01-03") +ymd(x) +``` + +- `day()` -- extracts the day (number) of a date variable + + +```r +day(x) +``` + +- `month()` -- extracts the month (number) of a date variable + + +```r +month(x) +``` + +- `year()` -- extracts the year of a date variable + + +```r +year(x) +``` + +## Visualizing Data + +- `plot()` -- a generic function for plotting R objects + - In this lesson `plot()` was used to create bargraphs of categorical + variables. + + +```r +plot(penguins$species) +``` + +